1 Study materials

1.1 R data types and data structures

Read Chapter 2 of R for Data Science (2e).

Carry out the data types and data structures activities and exercises.


1.2 More work with the penguins dataset

Read Sections 1.3–1.7 of R for Data Science (2e).

Carry out the penguin dataset activities and exercies.


2 Activities

2.1 Working with R data types and data structures

Refer to the data types and data structures section and perform the following tasks.

2.1.1 Data types in R

  1. Numeric/double precision variables can store real numbers (with a decimal point). The data type of a variable can be determined using the typeof function. Check the data type of the numHydAtoms variable.

  2. Integer variables can store integers only. Assign the value -34L to a variable numStudents and check its data type.

  3. Logical variables can only take two values, TRUE or FALSE. Assign the value TRUE to a variable passed and check its data type.

  4. Character variables store one or more characters. Two or more characters are known as a string. One important consideration in working with character data is that character strings are also used for in-built R commands, variables, and functions. For example, we may want to store the string James in a variable called studentName. In order to distinguish between strings that are R variables/commands or data (also called “literal strings”), the latter are enclosed in quotes ("). Assign the literal string "James" to the variable studentName. Check the variable’s data type.

Assign James to the studentName variable without quotes. What is the error? Why?

2.1.2 Data structures

  1. Very often it is necessary to deal with a large number of variables. For example, we may want to store the scores of 100 students in a class. It would be very unwieldy to have to define and manipulate 100 different variables, one for each student’s score. R provides vectors, which is a structured collection of variables, or a data structure. The vector stores many variables under one name and each one may be accessed by its position or index in the vector. Vectors may be created by the combine c() function. Create a numeric vector called student_scores with exam scores of 10 students using c(). Check the data type.

  2. The ith value of a vector may be accessed as <vector_name>[i]. A range of values from the ith to the jth position may be accessed by <vector_name>[i:j]. Print out the 6th score. Print out the 2nd through the 5th scores.

  3. It is possible to have vectors of other data types as well. Use c() to create a vector of 10 student names (A through J) called student_names. Check its data type. Print out the 2nd name and the 6th through 9th names.

  4. A very important data structure is the data.frame which is a collection of named vectors of the same length. It takes the form of a table, having the same number of columns as the component vectors and the same number of rows as the length of the vectors. Create a data frame called student_results with two columns, names and scores, containing student names and scores respectively.

  5. There are a number of methods available to inspect data frames. nrow() and ncol() determine the number of rows and columns respectively. summary() provides an overview and head() prints out the first few rows. Use these functions on the student_results data frame to inspect it.

  6. There are many ways to access the values in a data frame. An individual element in row i and column j can be accessed by <data_frame_name>[i,j]. An entire column may be accessed by <data_frame_name>$<column_name> or <data_frame_name>[,j]. An entire row may be accessed by <data_frame_name>[i,]. Print out the 5th student’s score. Print out all the scores. Print out the name and score of the last studemt.


2.2 Continuing to work with the penguins dataset

Load the libraries.

library(palmerpenguins)
library(ggplot2)

2.2.1 Writing ggplot calls concisely

Refer to Section 1.3 of R for Data Science (2e).

So long as the correct order of function arguments is observed, the name of the arguments (mapping, data etc) may be omitted, shortening the function calls.

Therefore, the code

ggplot(
        data = penguins,
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point()

can be shortened to

ggplot(
        penguins,
        aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point()

Note that changing the order will result in an error message.


2.2.2 Visualizing distributions of a single variable

  1. Refer to Section 1.4 of R for Data Science (2e).

One of the first steps in data analysis is to understand the probability distribution of a random variable. For example, in the penguins dataset, the species variable can take three values: Adelie, Chinstrap, and Gentoo. The probability distribution in this case is a set of three numbers giving the probabilities of a penguin being Adelie, Chinstrap, or Gentoo. We can empirically estimate the probabilities by computing the frequency of each value of the variable. A common way to visualize the frequency distribution of categorical/factor variables is with a bar chart. Since bars are geometrical objects, we have to use one of the geom_ functions, in this case geom_bar() to plot them.

  1. Use geom_bar() to plot the frequency distribution of the species variable as a bar chart. Note that we will only specify the mapping of the species variable to the x-axis since the frequency, which is plotted on the y-axis, is not a variable in the dataset but is computed automatically by geom_bar().

The distribution of a continuously varying, numerical, variable is visualized as a histogram. As one would expect this geometrical object is created by a dedicated function called geom_histogram(). Histograms subdivide the range of values a variable takes into bins and count the frequency of occurrence of the variable within each bin. geom_histogram() takes an argument bins, which specifies the number of bins the range of values is subdivided into.


  1. Use geom_histogram() to create a histogram of the body weight of the penguins with 30 bins. What can you discern about the penguins’ body mass from this plot?

  1. Imagine that you keep reducing the bin width to increase the number of bins. As you do so, you will discern more and more detail in the histogram. In the limit of zero bin width, the histogram will approach the probability density function. This can be visualized with the geom_density() function. Use the geom_density() function to visualize the probability density function of the body mass variable. How does the density compare to the histogram above?

2.2.3 Visualizing relationships

Refer to Section 1.5 of R for Data Science (2e).

How do we visualize relationships between two or more variables? Appropriate visualization depend on what type the two variables are.

2.2.3.1 A numerical and a categorical variable

The first case is when one variable is a continuously varying numerical variable, such as penguin body mass, and the other is a discrete, categorical variable such as penguin species. One way to visualize such a relatioship is to use box plots.

Box plots summarize distributions

  1. Visualize how the distribution of body mass varies between species using geom_boxplot() function. What can you conclude about how body mass varies with species based on this plot?

  1. Another approach could be to map the categorial variable to the color or fill aesthetic and plot the distribution of the numerical variable using geom_density(). How do these density plots compare to the ones above?

  2. We could also map the species categorical variable to the fill aesthetic. This creates the problem that part of the Adelie and Chinstrap distributions are hidden by the Gentoo distribution. The problem can be circumvented by using transparency, which can be accomplished using the alpha aesthetic. A value of 1 means completely opaque and lower values progressively increase the transparency.


2.2.3.2 Two categorical variables

Let’s say we wanted to visualize how the penguins population varies jointly with species and island. Both are categorical variables. We could map one to the x axis aesthetic and the other to the fill aesthetic.

  1. Plot a bar plot of the penguin population on each island stratified by population. What can be concluded about the distribution of the species across islands?

2.2.3.3 Two numerical variables

We have already visualized two numerical variables using a scatterplot in Module 1 Study Guide A. Scatterplots is one of the main ways of visualizing the relationship between two numerical variables.

2.2.3.4 More than two numerical variables

Let’s say we wanted to understand how the relationship between body mass and flipper length varied by species and island. This implies that we have to visualize the relationship between four variables, two numerical (body mass and flipper length) and two categorical (species and island). One can always map the extra variables to additional aesthetics.

  1. Make a scatterplot of flipper length on x and body mass on y, species mapped to the color aesthetic, and island mapped to the shape aesthetic.

Mapping too many variables to different aesthetics can make the plot cluttered and hard to interpret. It is possible to create multiple plots, one for each value of a categorical variable, using the facet_wrap() function. Each plot is called a facet and only shows the subset of the data corresponding to one particular value of the chosen categorical variable. The first argument of facet_wrap() specifies which variable to use while faceting and takes the form ~<variable_name>.

  1. Vary the correlation plots by island using the facet_wrap() function.

2.2.4 Saving plots

Refer to Section 1.6 of R for Data Science (2e).

For now, we have been content to display graphs within the R notebook. However, one may want to share graphs as part of presentations, research articles, emails etc and would like to save them as a graphics file. This can be accomplished with the ggsave() function. The filename argument specifies the name of file the plot is saved to.

  1. Use ggsave() to save the facetted plot above to a file in the Submissions\(\rightarrow\)Module 1 subdirectory of the repository in a file named <firstname>.<lastname>-savedplot1.png. Follow the instructions in Module 1 Study Guide A to add the file to the repository.

2.3 Additional exercises

2.3.1 R data types and structures

Before attempting the next exercise use the Session\(\rightarrow\)Load Workspace... menu option to load Exercises.RData from StudyGuides > Module 1.

  1. Insert a code chunk to determine what datatype the variables TT, UU, VV, WW, XX, YY, and ZZ are.

  2. Why are UU and WW different types even though both have the value “FALSE”?

  3. Inspect the value of TT and VV in the Environment tab of the Environments pane. What sort of data structures are they?

  4. Print out the 21st value of TT.

  5. Print out the 54th through 63rd values of VV.

  6. Insert a code chunk to determine the number of row and number of columns of the mpg data frame.

  7. Use the summary() function to determine datatype of the trans column of mpg.

  8. Print out the 134th entry in the 3rd column of mpg.

  9. Print out the entire 9th column.

  10. Print out the 11th row of mpg.


2.3.2 Working with the penguins dataset

2.3.3 Writing ggplot calls concisely

  1. Run the code below. Why does it give an error? Fix the code and run it.
require(palmerpenguins)
require(ggplot2)
ggplot(
        aes(x = flipper_length_mm, y = body_mass_g),
        penguins
      ) +
geom_point()

2.3.4 Plotting distributions and relationships

  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

  2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")
  1. Change the number of bins in histogram of body weight to 4 and 200. What are the pros and cons of having fewer bins? What are the pros and cons of having more bins?

  2. Visualize the distribution of penguin populations across island and species as was done above. However, set the position argument of geom_bar() to "fill" to make the bar heights equal so that it is easy to compare proportion of different species on each island.

  3. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

  4. Make a scatterplot of hwy vs. displ using the mpg data frame. Which is the independent variable and which is the dependent variable? Map the variable to the x and y aesthetics appropriately. What can be inferred about the relationship between the two variables? Does it make sense?

  5. In the scatterplot above, map a third, numerical variable to color. How does this contrast with the results of mapping categorical variables to color. What can you infer about the interrelationship of the three variables?

  6. Next, map the third numerical variable to the size aesthetic and then to both color and size. How do these aesthetics behave differently for categorical vs. numerical variables?

  7. Map the third numerical variable to the shape aesthetic. What happens? Why?