Read Chapter 2 of R for Data Science (2e).
Carry out the data types and data structures activities and exercises.
Read Sections 1.3–1.7 of R for Data Science (2e).
Carry out the penguin dataset activities and exercies.
Refer to the data types and data structures section and perform the following tasks.
Numeric/double precision
variables can store real numbers (with a decimal point). The data type
of a variable can be determined using the typeof
function.
Check the data type of the numHydAtoms
variable.
Integer variables can
store integers only. Assign the value -34L
to a variable
numStudents
and check its data type.
Logical variables can
only take two values, TRUE
or FALSE
. Assign
the value TRUE
to a variable passed
and check
its data type.
Character variables
store one or more characters. Two or more characters are known as a
string. One important consideration in working with character
data is that character strings are also used for in-built R commands,
variables, and functions. For example, we may want to store the string
James
in a variable called studentName
. In
order to distinguish between strings that are R variables/commands or
data (also called “literal strings”), the latter are enclosed in quotes
("
). Assign the literal string "James"
to the
variable studentName
. Check the variable’s data
type.
Assign James
to the
studentName
variable without quotes. What is the error?
Why?
Very often it is necessary to
deal with a large number of variables. For example, we may want to store
the scores of 100 students in a class. It would be very unwieldy to have
to define and manipulate 100 different variables, one for each student’s
score. R provides vectors, which is a structured collection of
variables, or a data structure. The vector stores many
variables under one name and each one may be accessed by its position or
index in the vector. Vectors may be created by the combine
c()
function. Create a numeric vector called
student_scores
with exam scores of 10 students using
c()
. Check the data type.
The i
th value of a
vector may be accessed as <vector_name>[i]
. A range
of values from the i
th to the j
th position may
be accessed by <vector_name>[i:j]
. Print out the 6th
score. Print out the 2nd through the 5th scores.
It is possible to have vectors of
other data types as well. Use c()
to create a vector of 10
student names (A
through J
) called
student_names
. Check its data type. Print out the 2nd name
and the 6th through 9th names.
A very important data structure
is the data.frame
which is a collection of named vectors of
the same length. It takes the form of a table, having the same number of
columns as the component vectors and the same number of rows as the
length of the vectors. Create a data frame called
student_results
with two columns, names
and
scores
, containing student names and scores
respectively.
There are a number of methods
available to inspect data frames. nrow()
and
ncol()
determine the number of rows and columns
respectively. summary()
provides an overview and
head()
prints out the first few rows. Use these functions
on the student_results
data frame to inspect
it.
There are many ways to access the
values in a data frame. An individual element in row i
and
column j
can be accessed by
<data_frame_name>[i,j]
. An entire column may be
accessed by <data_frame_name>$<column_name>
or
<data_frame_name>[,j]
. An entire row may be accessed
by <data_frame_name>[i,]
. Print out the 5th student’s
score. Print out all the scores. Print out the name and score of the
last studemt.
Load the libraries.
library(palmerpenguins)
library(ggplot2)
ggplot
calls conciselyRefer to Section 1.3 of R for Data Science (2e).
So long as the correct order of function arguments is observed, the
name of the arguments (mapping
, data
etc) may
be omitted, shortening the function calls.
Therefore, the code
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
can be shortened to
ggplot(
penguins,
aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
Note that changing the order will result in an error message.
One of the first steps in data analysis is to understand the
probability distribution of a random variable. For example, in
the penguins dataset, the species variable can take three values:
Adelie, Chinstrap, and Gentoo. The probability distribution in this case
is a set of three numbers giving the probabilities of a penguin being
Adelie, Chinstrap, or Gentoo. We can empirically estimate the
probabilities by computing the frequency of each value of the variable.
A common way to visualize the frequency distribution of
categorical/factor variables is with a bar chart.
Since bars are geometrical objects, we have to use one of the
geom_
functions, in this case geom_bar()
to
plot them.
geom_bar()
to plot
the frequency distribution of the species variable as a bar chart. Note
that we will only specify the mapping of the species variable to the
x-axis since the frequency, which is plotted on the y-axis, is not a
variable in the dataset but is computed automatically by
geom_bar()
.The distribution of a continuously varying, numerical,
variable is visualized as a histogram. As one would expect this
geometrical object is created by a dedicated function called
geom_histogram()
. Histograms subdivide the range of values
a variable takes into bins and count the frequency of occurrence of the
variable within each bin. geom_histogram()
takes an
argument bins
, which specifies the number of bins the range
of values is subdivided into.
geom_histogram()
to
create a histogram of the body weight of the penguins with 30 bins. What
can you discern about the penguins’ body mass from this
plot?geom_density()
function.
Use the geom_density()
function to visualize the
probability density function of the body mass variable. How does the
density compare to the histogram above?Refer to Section 1.5 of R for Data Science (2e).
How do we visualize relationships between two or more variables? Appropriate visualization depend on what type the two variables are.
The first case is when one variable is a continuously varying numerical variable, such as penguin body mass, and the other is a discrete, categorical variable such as penguin species. One way to visualize such a relatioship is to use box plots.