1 Study materials

1.1 R data types and data structures

Read Chapter 2 of R for Data Science (2e).

Carry out the data types and data structures activities and exercises.


1.2 More work with the penguins dataset

Read Sections 1.3–1.7 of R for Data Science (2e).

Carry out the penguin dataset activities and exercies.


2 Activities

2.1 Working with R data types and data structures

Refer to the data types and data structures section and perform the following tasks.

2.1.1 Data types in R

  1. Numeric/double precision variables can store real numbers (with a decimal point). The data type of a variable can be determined using the typeof function. Check the data type of the numHydAtoms variable.

  2. Integer variables can store integers only. Assign the value -34L to a variable numStudents and check its data type.

  3. Logical variables can only take two values, TRUE or FALSE. Assign the value TRUE to a variable passed and check its data type.

  4. Character variables store one or more characters. Two or more characters are known as a string. One important consideration in working with character data is that character strings are also used for in-built R commands, variables, and functions. For example, we may want to store the string James in a variable called studentName. In order to distinguish between strings that are R variables/commands or data (also called “literal strings”), the latter are enclosed in quotes ("). Assign the literal string "James" to the variable studentName. Check the variable’s data type.

Assign James to the studentName variable without quotes. What is the error? Why?

2.1.2 Data structures

  1. Very often it is necessary to deal with a large number of variables. For example, we may want to store the scores of 100 students in a class. It would be very unwieldy to have to define and manipulate 100 different variables, one for each student’s score. R provides vectors, which is a structured collection of variables, or a data structure. The vector stores many variables under one name and each one may be accessed by its position or index in the vector. Vectors may be created by the combine c() function. Create a numeric vector called student_scores with exam scores of 10 students using c(). Check the data type.

  2. The ith value of a vector may be accessed as <vector_name>[i]. A range of values from the ith to the jth position may be accessed by <vector_name>[i:j]. Print out the 6th score. Print out the 2nd through the 5th scores.

  3. It is possible to have vectors of other data types as well. Use c() to create a vector of 10 student names (A through J) called student_names. Check its data type. Print out the 2nd name and the 6th through 9th names.

  4. A very important data structure is the data.frame which is a collection of named vectors of the same length. It takes the form of a table, having the same number of columns as the component vectors and the same number of rows as the length of the vectors. Create a data frame called student_results with two columns, names and scores, containing student names and scores respectively.

  5. There are a number of methods available to inspect data frames. nrow() and ncol() determine the number of rows and columns respectively. summary() provides an overview and head() prints out the first few rows. Use these functions on the student_results data frame to inspect it.

  6. There are many ways to access the values in a data frame. An individual element in row i and column j can be accessed by <data_frame_name>[i,j]. An entire column may be accessed by <data_frame_name>$<column_name> or <data_frame_name>[,j]. An entire row may be accessed by <data_frame_name>[i,]. Print out the 5th student’s score. Print out all the scores. Print out the name and score of the last studemt.


2.2 Continuing to work with the penguins dataset

Load the libraries.

library(palmerpenguins)
library(ggplot2)

2.2.1 Writing ggplot calls concisely

Refer to Section 1.3 of R for Data Science (2e).

So long as the correct order of function arguments is observed, the name of the arguments (mapping, data etc) may be omitted, shortening the function calls.

Therefore, the code

ggplot(
        data = penguins,
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point()

can be shortened to

ggplot(
        penguins,
        aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point()

Note that changing the order will result in an error message.


2.2.2 Visualizing distributions of a single variable

  1. Refer to Section 1.4 of R for Data Science (2e).

One of the first steps in data analysis is to understand the probability distribution of a random variable. For example, in the penguins dataset, the species variable can take three values: Adelie, Chinstrap, and Gentoo. The probability distribution in this case is a set of three numbers giving the probabilities of a penguin being Adelie, Chinstrap, or Gentoo. We can empirically estimate the probabilities by computing the frequency of each value of the variable. A common way to visualize the frequency distribution of categorical/factor variables is with a bar chart. Since bars are geometrical objects, we have to use one of the geom_ functions, in this case geom_bar() to plot them.

  1. Use geom_bar() to plot the frequency distribution of the species variable as a bar chart. Note that we will only specify the mapping of the species variable to the x-axis since the frequency, which is plotted on the y-axis, is not a variable in the dataset but is computed automatically by geom_bar().

The distribution of a continuously varying, numerical, variable is visualized as a histogram. As one would expect this geometrical object is created by a dedicated function called geom_histogram(). Histograms subdivide the range of values a variable takes into bins and count the frequency of occurrence of the variable within each bin. geom_histogram() takes an argument bins, which specifies the number of bins the range of values is subdivided into.


  1. Use geom_histogram() to create a histogram of the body weight of the penguins with 30 bins. What can you discern about the penguins’ body mass from this plot?

  1. Imagine that you keep reducing the bin width to increase the number of bins. As you do so, you will discern more and more detail in the histogram. In the limit of zero bin width, the histogram will approach the probability density function. This can be visualized with the geom_density() function. Use the geom_density() function to visualize the probability density function of the body mass variable. How does the density compare to the histogram above?

2.2.3 Visualizing relationships

Refer to Section 1.5 of R for Data Science (2e).

How do we visualize relationships between two or more variables? Appropriate visualization depend on what type the two variables are.

2.2.3.1 A numerical and a categorical variable

The first case is when one variable is a continuously varying numerical variable, such as penguin body mass, and the other is a discrete, categorical variable such as penguin species. One way to visualize such a relatioship is to use box plots.