Read Chapter 2 of R for Data Science (2e).
Carry out the data types and data structures activities and exercises.
Read Sections 1.3–1.7 of R for Data Science (2e).
Carry out the penguin dataset activities and exercies.
Refer to the data types and data structures section and perform the following tasks.
Numeric/double precision
variables can store real numbers (with a decimal point). The data type
of a variable can be determined using the typeof
function.
Check the data type of the numHydAtoms
variable.
Integer variables can
store integers only. Assign the value -34L
to a variable
numStudents
and check its data type.
Logical variables can
only take two values, TRUE
or FALSE
. Assign
the value TRUE
to a variable passed
and check
its data type.
Character variables
store one or more characters. Two or more characters are known as a
string. One important consideration in working with character
data is that character strings are also used for in-built R commands,
variables, and functions. For example, we may want to store the string
James
in a variable called studentName
. In
order to distinguish between strings that are R variables/commands or
data (also called “literal strings”), the latter are enclosed in quotes
("
). Assign the literal string "James"
to the
variable studentName
. Check the variable’s data
type.
Assign James
to the
studentName
variable without quotes. What is the error?
Why?
Very often it is necessary to
deal with a large number of variables. For example, we may want to store
the scores of 100 students in a class. It would be very unwieldy to have
to define and manipulate 100 different variables, one for each student’s
score. R provides vectors, which is a structured collection of
variables, or a data structure. The vector stores many
variables under one name and each one may be accessed by its position or
index in the vector. Vectors may be created by the combine
c()
function. Create a numeric vector called
student_scores
with exam scores of 10 students using
c()
. Check the data type.
The i
th value of a
vector may be accessed as <vector_name>[i]
. A range
of values from the i
th to the j
th position may
be accessed by <vector_name>[i:j]
. Print out the 6th
score. Print out the 2nd through the 5th scores.
It is possible to have vectors of
other data types as well. Use c()
to create a vector of 10
student names (A
through J
) called
student_names
. Check its data type. Print out the 2nd name
and the 6th through 9th names.
A very important data structure
is the data.frame
which is a collection of named vectors of
the same length. It takes the form of a table, having the same number of
columns as the component vectors and the same number of rows as the
length of the vectors. Create a data frame called
student_results
with two columns, names
and
scores
, containing student names and scores
respectively.
There are a number of methods
available to inspect data frames. nrow()
and
ncol()
determine the number of rows and columns
respectively. summary()
provides an overview and
head()
prints out the first few rows. Use these functions
on the student_results
data frame to inspect
it.
There are many ways to access the
values in a data frame. An individual element in row i
and
column j
can be accessed by
<data_frame_name>[i,j]
. An entire column may be
accessed by <data_frame_name>$<column_name>
or
<data_frame_name>[,j]
. An entire row may be accessed
by <data_frame_name>[i,]
. Print out the 5th student’s
score. Print out all the scores. Print out the name and score of the
last studemt.
Load the libraries.
library(palmerpenguins)
library(ggplot2)
ggplot
calls conciselyRefer to Section 1.3 of R for Data Science (2e).
So long as the correct order of function arguments is observed, the
name of the arguments (mapping
, data
etc) may
be omitted, shortening the function calls.
Therefore, the code
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
can be shortened to
ggplot(
penguins,
aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
Note that changing the order will result in an error message.
One of the first steps in data analysis is to understand the
probability distribution of a random variable. For example, in
the penguins dataset, the species variable can take three values:
Adelie, Chinstrap, and Gentoo. The probability distribution in this case
is a set of three numbers giving the probabilities of a penguin being
Adelie, Chinstrap, or Gentoo. We can empirically estimate the
probabilities by computing the frequency of each value of the variable.
A common way to visualize the frequency distribution of
categorical/factor variables is with a bar chart.
Since bars are geometrical objects, we have to use one of the
geom_
functions, in this case geom_bar()
to
plot them.
geom_bar()
to plot
the frequency distribution of the species variable as a bar chart. Note
that we will only specify the mapping of the species variable to the
x-axis since the frequency, which is plotted on the y-axis, is not a
variable in the dataset but is computed automatically by
geom_bar()
.The distribution of a continuously varying, numerical,
variable is visualized as a histogram. As one would expect this
geometrical object is created by a dedicated function called
geom_histogram()
. Histograms subdivide the range of values
a variable takes into bins and count the frequency of occurrence of the
variable within each bin. geom_histogram()
takes an
argument bins
, which specifies the number of bins the range
of values is subdivided into.
geom_histogram()
to
create a histogram of the body weight of the penguins with 30 bins. What
can you discern about the penguins’ body mass from this
plot?geom_density()
function.
Use the geom_density()
function to visualize the
probability density function of the body mass variable. How does the
density compare to the histogram above?Refer to Section 1.5 of R for Data Science (2e).
How do we visualize relationships between two or more variables? Appropriate visualization depend on what type the two variables are.
The first case is when one variable is a continuously varying numerical variable, such as penguin body mass, and the other is a discrete, categorical variable such as penguin species. One way to visualize such a relatioship is to use box plots.
geom_boxplot()
function. What can you conclude about how body mass varies with species
based on this plot?Another approach could be to map
the categorial variable to the color or fill aesthetic and plot the
distribution of the numerical variable using
geom_density()
. How do these density plots compare to the
ones above?
We could also map the species
categorical variable to the fill aesthetic. This creates the problem
that part of the Adelie and Chinstrap distributions are hidden by the
Gentoo distribution. The problem can be circumvented by using
transparency, which can be accomplished using the alpha
aesthetic. A value of 1 means completely opaque and lower values
progressively increase the transparency.
Let’s say we wanted to visualize how the penguins population varies jointly with species and island. Both are categorical variables. We could map one to the x axis aesthetic and the other to the fill aesthetic.
We have already visualized two numerical variables using a scatterplot in Module 1 Study Guide A. Scatterplots is one of the main ways of visualizing the relationship between two numerical variables.
Let’s say we wanted to understand how the relationship between body mass and flipper length varied by species and island. This implies that we have to visualize the relationship between four variables, two numerical (body mass and flipper length) and two categorical (species and island). One can always map the extra variables to additional aesthetics.
color
aesthetic, and island mapped to the shape
aesthetic.Mapping too many variables to different aesthetics can make the plot
cluttered and hard to interpret. It is possible to create multiple
plots, one for each value of a categorical variable, using the
facet_wrap()
function. Each plot is called a facet
and only shows the subset of the data corresponding to one particular
value of the chosen categorical variable. The first argument of
facet_wrap()
specifies which variable to use while faceting
and takes the form ~<variable_name>
.
facet_wrap()
function.Refer to Section 1.6 of R for Data Science (2e).
For now, we have been content to display graphs within the R
notebook. However, one may want to share graphs as part of
presentations, research articles, emails etc and would like to save them
as a graphics file. This can be accomplished with the
ggsave()
function. The filename
argument
specifies the name of file the plot is saved to.
ggsave()
to save
the facetted plot above to a file in the Submissions
\(\rightarrow\)Module 1
subdirectory of the repository in a file named
<firstname>.<lastname>-savedplot1.png
. Follow
the instructions in Module 1 Study Guide A to add the file to the
repository.Before attempting the next
exercise use the Session
\(\rightarrow\)Load Workspace...
menu option to load Exercises.RData
from
StudyGuides
> Module 1
.
Insert a code chunk to determine
what datatype the variables TT, UU, VV, WW, XX, YY, and ZZ
are.
Why are UU
and
WW
different types even though both have the value
“FALSE”?
Inspect the value of
TT
and VV
in the Environment
tab
of the Environments
pane. What sort of data structures are
they?
Print out the 21st value of
TT
.
Print out the 54th through 63rd
values of VV
.
Insert a code chunk to determine
the number of row and number of columns of the mpg
data
frame.
Use the summary()
function to determine datatype of the trans
column of
mpg
.
Print out the 134th entry in the
3rd column of mpg
.
Print out the entire 9th column.
Print out the 11th row of
mpg
.
ggplot
calls conciselyrequire(palmerpenguins)
require(ggplot2)
ggplot(
aes(x = flipper_length_mm, y = body_mass_g),
penguins
) +
geom_point()
Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?
ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")
Change the number of bins in histogram of body weight to 4 and 200. What are the pros and cons of having fewer bins? What are the pros and cons of having more bins?
Visualize the distribution of
penguin populations across island and species as was done above. However, set the position
argument of geom_bar()
to "fill"
to make the
bar heights equal so that it is easy to compare proportion of different
species on each island.
The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?
Make a scatterplot of hwy vs. displ using the mpg data frame. Which is the independent variable and which is the dependent variable? Map the variable to the x and y aesthetics appropriately. What can be inferred about the relationship between the two variables? Does it make sense?
In the scatterplot above, map a third, numerical variable to color. How does this contrast with the results of mapping categorical variables to color. What can you infer about the interrelationship of the three variables?
Next, map the third numerical
variable to the size
aesthetic and then to both color and
size. How do these aesthetics behave differently for categorical
vs. numerical variables?
Map the third numerical variable
to the shape
aesthetic. What happens? Why?