1 Study materials

1.1 Data transformation

Read Chapter 3 of R for Data Science (2e).

1.1.1 Introduction to data transformation

1.1.2 Row verbs

1.1.3 Column verbs

1.1.4 Group verbs

Carry out the data transformation activities and exercises.


2 Activities

2.1 Data transformation

So far we have visualized or analyzed data as they were provided to us. But very often we may want to change or transform the data in some way before analyzing or visualizing it. We saw one example of this in Module 2 Study Guide B when we removed penguins with missing data which were making the interpretation of plots difficult. Other kinds of transformation might be to compute quantities from the raw data, for example, one may want to understand how flower proportions, given by the ratio of petal width and length vary by species in the iris dataset. Or, one may want to summarize the raw data. For example we may compute the means and standard deviations of highway mileage for each class of vehicle in the mpg dataset to gain insight into which class of vehicles has the best mileage.

2.1.1 Load libraries

We will be using the dplyr library, which along with ggplot and other libraries is part of tidyverse, for data transformation. More on tidyverse later. In this section, we will also follow the book and use the nycflights13 library, whose flights data frame contains all 336,776 flights that departed from New York City in 2013. Let us load these libraries.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ lubridate::hms() masks vembedr::hms()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)

Note the message about conflicts above. This is because dplyr etc are redefining or masking functions in other packages. If one still wants to use masked version, one call it with its full name, stats::filter() for example.o

2.1.2 Introduction

  1. flights is a tibble, which is similar to a data frame. One way to look at tibbles is glimpse(), which prints each column’s data type along with a small number of datapoints. Use glimpse() to view flights. What do the rows represent? How many factor/categorical variables does it have? How many numeric/quantitative?

  2. dplyr provides functions for transforming tibbles. For example we can utilize the filter function to subset rows matching a condition. Use filter to find all the flights arriving in IAH (Houston area) airports and assign to a variable. How many flights are there?

  3. Another type of transformation is computing statistical measures such as means and standard deviation. Let’s say we wanted to identify whether any particular days of the year were particularly bad for flight delays at IAH. We would want to identify each combination of day, month, and year and compute the average arrival delay across all flights, irrespective of the values of the other factors such as origin, carrier etc. This can be accomplished in two steps. First, use the group_by() function to create each unique combination of year, month, and day and assign to a new variable. Inspect the new data frame. What is different about it?

  4. Use the summarize() function to compute the mean of the arr_delay variable in a column called arr_delay_mean and assign to a new variable. Inspect the new data frame and determine what has changed.

  5. Given that these functions are operating sequentially on the same tibble, it is inefficient and unnecessary to keep creating new variables and passing them on to the next function. To accomplish these tasks in a much more concise and easy-to-read manner dplyr provides the pipe |> operator. When using the pipe, the first argument of these functions, the tibble, is omitted and only the final result need be assigned to a variable. Use the |> to carry out the steps above in the single statement and assign the result to a variable.


Functions in dplyr are called verbs (since they perform actions). There are a number of verbs and they can be classified based on whether they operate on row, columns, groups, or entire tables.

2.1.3 Row verbs

  1. filter keeps rows that match values of columns or combinations constructed with Boolean operators. Filter the flights dataset to identify flights leaving in the first quarter of the year. How many were there?

  2. It can be cumbersome to test whether a variable matches one of several options by combining the == and | operators, so R provides a special operator %in%. It returns TRUE if the variable is equal to one of the elements of a vector and FALSE if the variable is not equal any of the elements of a vector. Rewrite the filtering statement above using the %in% operator.

  3. arrange() changes the order of the rows based on the values of the columns. Given a column name, it sorts the rows in ascending order. The order can be switched to descending by enclosing the column name in desc(). If multiple columns are provided, then a nested sort is performed. The rows are sorted in ascending or descending order of the first column. Then ties in first column are resolved by sorting them by the values of the second columns and so on. Identify the earliest flight in this dataset by sorting according to year, month, day, and departure time. Which flight was the first to leave in the year?

  4. Find the last flight to leave on January 1, 2013.

  5. You can also sort according to a function that computes some quantity of interest from the variables. For example, we may be interested in which flights made up the departure delay en route. Determine flights that made up the departure delay by sorting the rows on the difference between the arrival and departure delays. Were there any flights that arrived early despite leaving late?

  6. distinct() finds unique values of columns or of combinations of columns. Find how many different airports did flights arrive to.

  7. Find how many origin/destination pairs are represented in the dataset.

  8. By default, the other columns are not returned, but can be retained with the .keep_all argument. Repeat the above but keep all the columns.

  9. Using count() one can determine how many rows in the dataset match the particular combination. If the sort argument is TRUE then the resulting tibble will be sorted in descending order. Find out which origin/destination pair had the most flights.

2.1.4 Column verbs

  1. Sometimes we would like to compute new things from the variables present in the dataset. This can be accomplished with the mutate() verb that adds new columns to the tibble containing the new computed quantities. Add two new columns to the tibble, one that computes the time gained in the flight and another that computes the speed of the plane in miles per hour.

  2. The columns can be placed at a particular position using the .before or .after arguments. One can also specify a column name instead of a position. Place the new columns before the 8th column. Place the new columns after the distance column.

  3. Sometimes datasets may have hundreds or thousands of variables, making it a challenge even to perform initial exploration. In such situations, the select() function can subset select columns.

  • Select just the origin and dep_delay columns.

  • Select a range of columns from year to day.

  • Invert the above selection.

  • Select only the columns with character data.

  1. Rename the tailnum column to tail_num.

  2. Relocate the time_hour column to the front. Move the columns containing flight time information, from air_time to time_hour to the position before carrier.


2.1.5 Group verbs

  1. As discussed in the introduction to dplyr at the start, one can create groups using the group_by() verb. This allows the computation of statistical summaries of the observations belonging to each group. For example, let’s say we wanted to see if there was a particular month on the year that was particularly bad for light delays, one could group by month and computer the mean departure and arrival delays. Group the obervation by month and computer the mean departure and arrival delays in each month. Make sure to use the na.rm argument to the mean() function to remove NA from the calculation.

  2. It is possible to add multiple statistical summaries. For example, we may wonder whether these differences are real or due to random sampling. We could get an idea by determining how many observations each group has; the greater the observations, the less likely that the differences are due to sampling. Use the n() function to add a column with the number of observations in each group. We could also determine the standard deviations to get an idea whether the means are very different or not. Add a third column for standard deviations. Do these differences appear to be significant?

  3. When making groups, it is useful to inspect some of the observations from each group to understand differences between them. For this purpose, dplyr provides the slice_head(n = <numrows>), slice_tail(n = <numrows>), and slice_sample(n = <numrows>) functions which return the first, last, and a random sample of <numrows> respectively. slice_min(x, n = <numrows>) and slice_max(x, n = <numrows>) return the rows with the smallest or largestvalues of columnx` respectively. Inspect the first two rows and two random rows of each month in the dataset.

  4. Providing multiple variables to the group_by() command leads to the creation of groups corresponding to all the different combinations of the values of the three variables. The resulting data frame changes the groups to exclude the last variable. Use .groups = "keep" to preserve the groups or .groups = "drop" to remove all grouping. Group the year, month, and day variables, determine the average departure delays and the number of flights on each day and find which day had the highest delay.


2.2 Additional exercises

2.2.1 Data transformation

2.2.1.1 Row verbs

  1. In a single pipeline for each condition, find all flights that meet the condition:
  • Had an arrival delay of two or more hours
  • Flew to Houston (IAH or HOU)
  • Were operated by United, American, or Delta
  • Departed in summer (July, August, and September)
  • Arrived more than two hours late but didn’t leave late
  • Were delayed by at least an hour, but made up over 30 minutes in flight
  1. Sort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.

  2. Was there a flight on every day of 2013?

2.2.1.2 Column verbs

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related? Check whether this is true. Are there any discrepancies? Why?

  2. There are helper functions starts_with(), ends_with() and contains() that take string arguments and return the columns that start with, end with, or contain the string. Use one of these functions to select both the dep_time and dep_delay columns without explicitly specifying the column names.

  3. Why doesn’t the following work, and what does the error mean?

  4. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

2.2.1.3 Group verbs

  1. Which carrier has the worst average delays?

  2. Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about small carriers who might not fly to very many airports. Determine how many airports each carrier flies from.)

  3. Find the flights that are most delayed upon departure from each origin.

  4. Find the flights that are most delayed on average upon departure from each origin. Can we conclude that one is more likely to encounter a delay on these flights?

  5. How do departure delays vary over the course of the day? Illustrate your answer with a plot.


End of Module 3 Study Guide A