Read Chapter 3 of R for Data Science (2e).
So far we have visualized or analyzed
data as they were provided to us. But very often we may want to change
or transform the data in some way before analyzing or
visualizing it. We saw one example of this in Module 2 Study Guide B
when we removed penguins with missing data which were making the
interpretation of plots difficult. Other kinds of transformation might
be to compute quantities from the raw data, for example, one may want to
understand how flower proportions, given by the ratio of petal width and
length vary by species in the iris
dataset. Or, one may
want to summarize the raw data. For example we may compute the
means and standard deviations of highway mileage for each class of
vehicle in the mpg
dataset to gain insight into which class
of vehicles has the best mileage.
We will be using the
dplyr
library, which along with ggplot
and
other libraries is part of tidyverse
, for data
transformation. More on tidyverse
later. In this section,
we will also follow the book and use the nycflights13
library, whose flights
data frame contains all 336,776
flights that departed from New York City in 2013. Let us load these
libraries.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::hms() masks vembedr::hms()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
Note the message about conflicts
above. This is because dplyr
etc are redefining or
masking functions in other packages. If one still wants to use
masked version, one call it with its full name,
stats::filter()
for example.o
flights
is a
tibble, which is similar to a data frame. One way to look at
tibbles is glimpse()
, which prints each column’s
data type along with a small number of datapoints. Use
glimpse()
to view flights. What do the rows represent? How
many factor/categorical variables does it have? How many
numeric/quantitative?
dplyr
provides
functions for transforming tibbles. For example we can utilize the
filter
function to subset rows matching a condition. Use
filter
to find all the flights arriving in IAH (Houston
area) airports and assign to a variable. How many flights are
there?
Another type of transformation is
computing statistical measures such as means and standard deviation.
Let’s say we wanted to identify whether any particular days of the year
were particularly bad for flight delays at IAH. We would want to
identify each combination of day, month, and year and compute the
average arrival delay across all flights, irrespective of the values of
the other factors such as origin, carrier etc. This can be accomplished
in two steps. First, use the group_by()
function to create
each unique combination of year, month, and day and assign to a new
variable. Inspect the new data frame. What is different about
it?
Use the summarize()
function to compute the mean of the arr_delay
variable in a
column called arr_delay_mean
and assign to a new variable.
Inspect the new data frame and determine what has
changed.
Given that these functions are
operating sequentially on the same tibble, it is inefficient and
unnecessary to keep creating new variables and passing them on to the
next function. To accomplish these tasks in a much more concise and
easy-to-read manner dplyr
provides the pipe
|>
operator. When using the pipe, the first argument of
these functions, the tibble, is omitted and only the final result need
be assigned to a variable. Use the |>
to carry out the
steps above in the single statement and assign the result to a
variable.
Functions in dplyr
are
called verbs (since they perform actions). There are a number
of verbs and they can be classified based on whether they operate on
row, columns, groups, or entire tables.
filter
keeps rows
that match values of columns or combinations constructed with Boolean
operators. Filter the flights dataset to identify flights leaving in the
first quarter of the year. How many were there?
It can be cumbersome to test
whether a variable matches one of several options by combining the
==
and |
operators, so R provides a special
operator %in%
. It returns TRUE
if the variable
is equal to one of the elements of a vector and FALSE
if
the variable is not equal any of the elements of a vector. Rewrite the
filtering statement above using the %in%
operator.
arrange()
changes
the order of the rows based on the values of the columns. Given a column
name, it sorts the rows in ascending order. The order can be switched to
descending by enclosing the column name in desc()
. If
multiple columns are provided, then a nested sort is performed. The rows
are sorted in ascending or descending order of the first column. Then
ties in first column are resolved by sorting them by the values of the
second columns and so on. Identify the earliest flight in this dataset
by sorting according to year, month, day, and departure time. Which
flight was the first to leave in the year?
Find the last flight to leave on January 1, 2013.
You can also sort according to a function that computes some quantity of interest from the variables. For example, we may be interested in which flights made up the departure delay en route. Determine flights that made up the departure delay by sorting the rows on the difference between the arrival and departure delays. Were there any flights that arrived early despite leaving late?
distinct()
finds
unique values of columns or of combinations of columns. Find how many
different airports did flights arrive to.
Find how many origin/destination pairs are represented in the dataset.
By default, the other columns are
not returned, but can be retained with the .keep_all
argument. Repeat the above but keep all the columns.
Using count()
one
can determine how many rows in the dataset match the particular
combination. If the sort
argument is TRUE
then
the resulting tibble will be sorted in descending order. Find out which
origin/destination pair had the most flights.
Sometimes we would like to
compute new things from the variables present in the dataset. This can
be accomplished with the mutate()
verb that adds new
columns to the tibble containing the new computed quantities. Add two
new columns to the tibble, one that computes the time gained in the
flight and another that computes the speed of the plane in miles per
hour.
The columns can be placed at a
particular position using the .before
or
.after
arguments. One can also specify a column name
instead of a position. Place the new columns before the 8th column.
Place the new columns after the distance column.
Sometimes datasets may have
hundreds or thousands of variables, making it a challenge even to
perform initial exploration. In such situations, the
select()
function can subset select
columns.
Select just the
origin
and dep_delay
columns.
Select a range of columns from
year
to day
.
Invert the above selection.
Select only the columns with character data.
Rename the tailnum
column to tail_num
.
Relocate the
time_hour
column to the front. Move the columns containing
flight time information, from air_time
to
time_hour
to the position before
carrier
.
As discussed in the introduction
to dplyr
at the start, one can create groups using the
group_by()
verb. This allows the computation of statistical
summaries of the observations belonging to each group. For example,
let’s say we wanted to see if there was a particular month on the year
that was particularly bad for light delays, one could group by month and
computer the mean departure and arrival delays. Group the obervation by
month and computer the mean departure and arrival delays in each month.
Make sure to use the na.rm
argument to the
mean()
function to remove NA
from the
calculation.
It is possible to add multiple
statistical summaries. For example, we may wonder whether these
differences are real or due to random sampling. We could get an idea by
determining how many observations each group has; the greater the
observations, the less likely that the differences are due to sampling.
Use the n()
function to add a column with the number of
observations in each group. We could also determine the standard
deviations to get an idea whether the means are very different or not.
Add a third column for standard deviations. Do these differences appear
to be significant?
When making groups, it is useful
to inspect some of the observations from each group to understand
differences between them. For this purpose, dplyr
provides
the slice_head(n = <numrows>)
,
slice_tail(n = <numrows>)
, and
slice_sample(n = <numrows>)
functions which return
the first, last, and a random sample of <numrows>
respectively. slice_min(x, n = <numrows>)
and
slice_max(x, n = <numrows>) return the rows with the smallest or largest
values of column
x`
respectively. Inspect the first two rows and two random rows of each
month in the dataset.
Providing multiple variables to
the group_by()
command leads to the creation of groups
corresponding to all the different combinations of the values of the
three variables. The resulting data frame changes the groups to exclude
the last variable. Use .groups = "keep"
to preserve the
groups or .groups = "drop"
to remove all grouping. Group
the year
, month
, and day
variables, determine the average departure delays and the number of
flights on each day and find which day had the highest
delay.
Sort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.
Was there a flight on every day of 2013?
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related? Check whether this is true. Are there any discrepancies? Why?
There are helper functions
starts_with()
, ends_with()
and
contains()
that take string arguments and return the
columns that start with, end with, or contain the string. Use one of
these functions to select both the dep_time
and
dep_delay
columns without explicitly specifying the column
names.
Why doesn’t the following work, and what does the error mean?
Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
Which carrier has the worst average delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about small carriers who might not fly to very many airports. Determine how many airports each carrier flies from.)
Find the flights that are most delayed upon departure from each origin.
Find the flights that are most delayed on average upon departure from each origin. Can we conclude that one is more likely to encounter a delay on these flights?
How do departure delays vary over the course of the day? Illustrate your answer with a plot.
End of Module 3 Study Guide A