Please run these code chunks before you start.
library(vembedr)
library(kableExtra)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ lubridate::hms() masks vembedr::hms()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
library(nycflights13)
library(tidycovid19)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
If you are returning to this HW and had previously saved the workspace to a file, you can load the workspace by running this code chunk.
load(file = "covid.RData")
If you are doing this for the first time, run the code chunk below to download the merged COVID19 data.
covid <- download_merged_data(cached = TRUE, silent = TRUE)
Save the workspace to a file so that
you don’t have to keep downloading the dataset each time you return to
the HW. Make sure to have saved the HW R notebook in the
Submissions/Module3
directory before doing this so that the
file is saved there. You can do this each time you stop working if there
are intermediate data that take a long time to produce.
save.image(file = "covid.RData")
Run the code chunk below to see the data sources.
data(tidycovid19_data_sources)
df <- tidycovid19_data_sources |> select(-id)
df$description[nrow(df)] <- paste(
"The merged dataset provided by the tidycovid19 R package. Contains data",
"from all sources mentioned above."
)
kable(df) |> kableExtra::kable_styling()
function_name | description | url | last_data |
---|---|---|---|
download_jhu_csse_covid19_data() | The COVID-19 Data Repository by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) relies upon publicly available data from multiple sources that do not always agree. It is updated daily. The data comes in three data frames that you can select by the ‘type’ parameter. The ‘country’ data frame contains the global country-level data reported by JHU CSSE by aggregating over the regional data for countries that have regional data available. The ‘country_region’ data frame provides regional data for the countries that have regional data available (mosty Australia, Canada and China). The ‘us_county’ data frame reports the data for the U.S. at the county level. Please note: JHU stopped updating the data on March 10, 2023. | https://github.com/CSSEGISandData/COVID-19 | 2023-03-09 |
download_ecdc_covid19_data() | Country-level weekly data on new cases and deaths provided by the European Centre for Disease Prevention and Control (ECDC). The data was updated daily until 2020-12-14 and contains the latest available public data on the number of new Covid-19 cases reported per week and per country. | https://www.ecdc.europa.eu/en/covid-19/data | 2023-11-27 |
download_owid_data() | The Our World in Data team systematically collects data on Covid-19 testing, hospitalizations, and vaccinations from multiple national sources. Data points are collected with varying frequency across countries. The definition on what consitutes a ‘test’ varies, reflected by the variable ‘tests_units’ in the data frame. The vaccination data is currently only available based on ad hoc disclosures by a small set of countries. | https://github.com/owid/covid-19-data/tree/master/public/data | 2024-01-27 |
download_wbank_data() | The data frame reports current country-level statistics from the World Bank. The regional and income level classifications are also provided by the World Bank. ‘life_expectancy’ is measured in years at birth and ‘gdp_capita’ is measured in 2010 US-$. The original World Bank data items are (in the order how they are represented in the data frame) ‘SP.POP.TOTL’, ‘AG.LND.TOTL.K2’, ‘EN.POP.DNST’, ‘EN.URB.LCTY’, ‘SP.DYN.LE00.IN’, ‘NY.GDP.PCAP.KD’. When you set the parameter ‘var_def’ tot ‘TRUE’. the data comes in a list containing two data frames. The first contains the actual data, the second contains variable definitions. | https://data.worldbank.org | 2024-01-28 |
download_acaps_npi_data() | The #COVID19 Government Measures Dataset is provided by ACAPS. It puts together measures implemented by governments worldwide in response to the Coronavirus pandemic. Data collection includes secondary data review. The data is reported in event structure with an event reflecting a government measure. Measures are characterized as being either imposing/extending measures or lifting them and categorized in five categories with each category being split up in further sub-categories. Please note: ACAPS stopped updating the data on December 10, 2020 | https://www.acaps.org/covid19-government-measures-dataset | 2021-01-04 |
download_oxford_npi_data() | The data on the Oxford Coronavirus Government Response Tracker (OxCGRT) on non-pharmaceutical interventions comes in two data frames that you can select by setting the ‘type’ parameter. The ‘measures’ data frame reports data on governmental response measures as reported by the Oxford OxCGRT team. It is tidied by arranging its content by measure. All original country-day observations that are either initial or represent a value (not note) change from the previous day are included. Economic measures (E1-E4) are not included. The ‘index’ data frame reports the ‘Stringency Index’ and the ‘Legacy Stringency Index’ as calculated by the OxCGRT team based on their governance response measures in a country-day structure. Please note: As indicated on the homepage of the project, too a large extend the data is no longer updated after December 31, 2022 while data review processes continue. | https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker | 2022-12-31 |
download_apple_mtr_data() | Apple’s Mobility Trend Reports reflect requests for directions in Apple Maps. The data frame is organized by country-day and its data are expressed as percentages relative to a baseline volume on January 13th, 2020. The data comes in three data frames that you can select by the ‘type’ parameter. The ‘country’ data frame contains country-day level data. The ‘country_region’ data frame provides regional data for regions for which Apple reports regional data. The ‘country_city’ data frame reports city-level data for cities for which Apple reports this data. Please note: Apple stopped providing this data on April 14, 2022 | https://www.apple.com/covid19/mobility | 2022-04-12 |
download_google_cmr_data() | Google’s Community Mobility Reports chart movement trends over time across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. They show how visits and length of stays at different places change in percentages compared to a baseline (the median value, for the corresponding day of the week, during the 5-week period Jan 3 to Feb 6, 2020). The data comes in three data frames that you can select by the ‘type’ parameter. The ‘country’ data frame contains country-day level data. The ‘country_region’ data frame provides regional data for the countries for which Google reports regional data. The ‘us_county’ data frame reports daily data for the U.S. at the county level. Please note: Google stopped providing that data on October 15, 2022 | https://www.google.com/covid19/mobility/ | 2022-10-15 |
download_google_trends_data() | Data are Google Search Volume (GSV) measures as provided by Google Trends API, with the default search term ‘coronavirus’. The data comes in four data frames that you can select by the ‘type’ parameter and the sample period comprises Jan 1, 2020 up to date. The ‘country’ data frame lists GSV by country, to assess which country on average uses the search term most often over the sample period. The ‘country-day’ data frame reports daily search volume data for all countries that show up in the ‘country’ data frame. Each value is relative within country, meaning that values across countries cannot be compared directly. The ‘region’ and ‘city’ data frames list the relative GSV across regions and city within countries when provided by Google Trends. Keep in mind that within each data frame GSV are relative measures with a maximum of 100 indicating the highest search volume. This implies that GSV measures are not comparable across data frames. | https://trends.google.com/ | 2024-01-21 |
download_merged_data() | The merged dataset provided by the tidycovid19 R package. Contains data from all sources mentioned above. | https://github.com/joachim-gassen/tidycovid19 | 2024-01-28 |
Print out the variable descriptions by running the code chunk below.
data(tidycovid19_variable_definitions)
tidycovid19_variable_definitions |>
select(var_name, var_def) |>
kable() |>
kableExtra::kable_styling()
var_name | var_def |
---|---|
iso3c | ISO3c country code as defined by ISO 3166-1 alpha-3 |
country | Country name |
date | Calendar date |
confirmed | Confirmed Covid-19 cases as reported by JHU CSSE (accumulated) |
deaths | Covid-19-related deaths as reported by JHU CSSE (accumulated) |
recovered | Covid-19 recoveries as reported by JHU CSSE (accumulated) |
ecdc_cases | Covid-19 cases as reported by ECDC (accumulated, weekly post 2020-12-14) |
ecdc_deaths | Covid-19-related deaths as reported by ECDC (accumulated, weekly post 2020-12-14) |
total_tests | Accumulated test counts as reported by Our World in Data |
tests_units | Definition of what constitutes a ‘test’ |
positive_rate | The share of COVID-19 tests that are positive, given as a rolling 7-day average |
hosp_patients | Number of COVID-19 patients in hospital on a given day |
icu_patients | Number of COVID-19 patients in intensive care units (ICUs) on a given day |
total_vaccinations | Total number of COVID-19 vaccination doses administered |
soc_dist | Number of social distancing measures reported up to date by ACAPS, net of lifted restrictions |
mov_rest | Number of movement restrictions reported up to date by ACAPS, net of lifted restrictions |
pub_health | Number of public health measures reported up to date by ACAPS, net of lifted restrictions |
gov_soc_econ | Number of social and economic measures reported up to date by ACAPS, net of lifted restrictions |
lockdown | Number of lockdown measures reported up to date by ACAPS, net of lifted restrictions |
oxcgrt_stringency_index | Stringency index as provided by the Oxford COVID-19 Government Response Tracker |
oxcgrt_stringency_legacy_index | Legacy stringency index based on old data format (prior April 25, 2020) as provided by the Oxford COVID-19 Government Response Tracker |
oxcgrt_government_response_index | Overall government response index as provided by the Oxford COVID-19 Government Response Tracker |
oxcgrt_containment_health_index | Containment and health index as provided by the Oxford COVID-19 Government Response Tracker |
apple_mtr_driving | Apple Maps usage for driving directions, as percentage*100 relative to the baseline of Jan 13, 2020 |
apple_mtr_walking | Apple Maps usage for walking directions, as percentage*100 relative to the baseline of Jan 13, 2020 |
apple_mtr_transit | Apple Maps usage for public transit directions, as percentage*100 relative to the baseline of Jan 13, 2020 |
gcmr_retail_recreation | Google Community Mobility Reports data for the frequency that people visit retail and recreation places expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gcmr_grocery_pharmacy | Google Community Mobility Reports data for the frequency that people visit grocery stores and pharmacies expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gcmr_parks | Google Community Mobility Reports data for the frequency that people visit parks expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gcmr_transit_stations | Google Community Mobility Reports data for the frequency that people visit transit stations expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gcmr_workplaces | Google Community Mobility Reports data for the frequency that people visit workplaces expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gcmr_residential | Google Community Mobility Reports data for the frequency that people visit residential places expressed as a percentage*100 change relative to the baseline period Jan 3 - Feb 6, 2020 |
gtrends_score | Google search volume for the term ‘coronavirus’, relative across time with the country maximum scaled to 100 |
gtrends_country_score | Country-level Google search volume for the term ‘coronavirus’ over a period starting Jan 1, 2020, relative across countries with the country having the highest search volume scaled to 100 (time-stable) |
region | Country region as classified by the World Bank (time-stable) |
income | Country income group as classified by the World Bank (time-stable) |
population | Country population as reported by the World Bank (original identifier ‘SP.POP.TOTL’, time-stable) |
land_area_skm | Country land mass in square kilometers as reported by the World Bank (original identifier ‘AG.LND.TOTL.K2’, time-stable) |
pop_density | Country population density as reported by the World Bank (original identifier ‘EN.POP.DNST’, time-stable) |
pop_largest_city | Population in the largest metropolian area of the country as reported by the World Bank (original identifier ‘EN.URB.LCTY’, time-stable) |
life_expectancy | Average life expectancy at birth of country citizens in years as reported by the World Bank (original identifier ‘SP.DYN.LE00.IN’, time-stable) |
gdp_capita | Country gross domestic product per capita, measured in 2010 US-$ as reported by the World Bank (original identifier ‘NY.GDP.PCAP.KD’, time-stable) |
timestamp | Date and time where data has been collected from authoritative sources |
Use dplyr
verbs or combinations thereof to explore this
dataset.
Let’s answer some questions about the time period this dataset covers. How many different days does the data cover?
What is the earliest and the last date in the dataset?
Let’s try and understand the different data sources. Read about the variables and identify the ones reported by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), European Centre for Disease Prevention and Control (ECDC), Our World in Data (OWID), ACAPS, and Oxford Coronavirus Government Response Tracker (OxCGRT). Present as an R markdown table.
What are the dates over which
each source provides information? Hint 1: If a data source does
not provide information on a particular day, it is recorded as
NA
. Hint 2: It might be useful to write a function
that takes the name of a column as a character string and determine the
earliest and latest dates it has valid data for. If doing this, make
sure to use get()
to get the actual column from the
character string.
Let’s try and understand something about the countries reporting in the dataset. How many different countries are reporting in the dataset?
The countries are identified by
their name and the International Standards Organization (ISO) 3-letter
country code (iso3c
). Make a table of the letter code and
country names and organize alphabetically by country name so it is easy
to look up codes.
How many countries reported cases on the earliest date? How many countries reported cases on the last date? Make a plot of the number countries reporting each day with date on the x-axis.
How many countries does each data source report? Hint: It might be useful to write a function that takes the name of a column as a character string and determine the number of countries.
When did the USA start reporting and how many cases were reported on that day? When did the USA stop reporting and how many cases were reported on that day?
How many cases did the USA report on the last day in the JHU data?
To produce this plot so that we can
map both JHU and ECDC variables to the y axis, we need to merge both of
them into a single column. In order to distinguish between the two
sources, we have to add an additional factor/categorical column which
indicates which data source the value in the combined column is from.
This new reshaped data frame is called the “long” format since it has
many more rows than the original. This reshaping can be accomplished by
the pivot_longer()
function from the tidyr
package. The columns to be combined are given as vector to the
cols
argument. The name of the additional column containing
the names of the columns (the factors/categories) is given by the
names_to
argument. The name of the merged column containing
the values/data is given by the values_to
argument. Run the
code chunk below to reshape and glimpse the tibble. How has the data
frame changed?
covid |>
pivot_longer(cols = c(confirmed, ecdc_cases), names_to = "dataSource", values_to = "combCases") |>
glimpse()
## Rows: 610,324
## Columns: 43
## $ iso3c <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "A…
## $ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "…
## $ date <date> 2020-03-13, 2020-03-13, 2020-03-14, …
## $ deaths <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ recovered <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ecdc_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_tests <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tests_units <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ positive_rate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hosp_patients <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ icu_patients <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ total_vaccinations <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ soc_dist <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mov_rest <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pub_health <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gov_soc_econ <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lockdown <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ oxcgrt_stringency_index <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1…
## $ oxcgrt_stringency_legacy_index <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1…
## $ oxcgrt_government_response_index <dbl> 2.08, 2.08, 2.08, 2.08, 2.08, 2.08, 8…
## $ oxcgrt_containment_health_index <dbl> 2.38, 2.38, 2.38, 2.38, 2.38, 2.38, 9…
## $ apple_mtr_driving <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ apple_mtr_walking <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ apple_mtr_transit <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gcmr_retail_recreation <dbl> -10, -10, -23, -23, -28, -28, -28, -2…
## $ gcmr_grocery_pharmacy <dbl> 40, 40, 15, 15, -13, -13, 1, 1, 8, 8,…
## $ gcmr_parks <dbl> -4, -4, -7, -7, -6, -6, -5, -5, -18, …
## $ gcmr_transit_stations <dbl> -5, -5, -19, -19, -18, -18, -18, -18,…
## $ gcmr_workplaces <dbl> 3, 3, -3, -3, -5, -5, -21, -21, -29, …
## $ gcmr_residential <dbl> 1, 1, 7, 7, 6, 6, 12, 12, 15, 15, 32,…
## $ gtrends_score <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gtrends_country_score <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ region <chr> "Latin America & Caribbean ", "Latin …
## $ income <chr> "High income", "High income", "High i…
## $ population <dbl> 106445, 106445, 106445, 106445, 10644…
## $ land_area_skm <dbl> 180, 180, 180, 180, 180, 180, 180, 18…
## $ pop_density <dbl> 591.8722, 591.8722, 591.8722, 591.872…
## $ pop_largest_city <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ life_expectancy <dbl> 74.626, 74.626, 74.626, 74.626, 74.62…
## $ gdp_capita <dbl> 32492.18, 32492.18, 32492.18, 32492.1…
## $ timestamp <dttm> 2024-01-28 21:12:00, 2024-01-28 21:1…
## $ dataSource <chr> "confirmed", "ecdc_cases", "confirmed…
## $ combCases <dbl> NA, 2, NA, 2, NA, 2, NA, 2, NA, 2, NA…
Once you have written the function and plotted the cases for a few countries, answer the following. Why is the plot only increasing? What is the difference between JHU and ECDC data? Does the pattern hold for other countries as well?
Given that the
confirmed
, deaths
, and recovered
in both data sources are cumulative, we should compute the daily new
cases, deaths, and recovered each day. The lag(x, n)
function returns the vector x
“lagged” by n
positions. For example, if n
is 1, then the output will
contain the previous element of x
in each position. If
n
is 2 then the output will have the element from two
positions behind and so no. Test out the function by using
seq()
to create a vector x
and then using
lag()
with different values of n
. Note what
happens to the first n
values. Why?
Use dplyr
verbs and
the lag()
function to add new columns to the
covid
tibble for daily cases, daily deaths, daily
recovered, daily tests, and daily vaccinations. Assign to a new
variable. Keep in mind that you will have to run this code chunk each
time you start a new session or the code you write below may not
work.
Write a function to plot new daily cases in time. The function
should take a tibble (default: the one you created in the previous
question) and a character vector of country codes (c("USA")
default) as arguments and display the curves of different countries
either in different colors or different facets. Make sure to filter out
NA
s. Which visualization is better, colors or facets? Look
at the progression in the USA, India, and the UK. Can you identify the
delta and omicron waves?
Another way to look at new cases
is their probability distribution. Write a function to plot the
distribution of new cases given a character vector of country codes.
What kind of probability distribution does it resemble? Is it unimodal
or not? Given the broad range of \(x\)
values, it might be useful to plot it on the log scale. This can be
accomplished by adding a ggplot
layer
scale_x_log10()
(try both linear and log
axes).
Generalize the time series plotting function to take a column name as an argument so that you can use it to plot things other than cases. Use it to plot the time course of daily testing. Hint: It may help to add the axis labels based on the column name argument.
Generalize the distribution plotting function to take a column name as an argument so that you can use it to plot things other than cases. Use it to plot the probability distribution of daily tests. Hint: It may help to add the axis labels based on the column name argument.
Plot the time course of hospitalizations and ICU patients.
Plot the probability distributions of hospitalizations and ICU patients.
Let’s compare total deaths at the end of the JHU CSSE reporting period. Based on the last date of the JHU CSSE dataset (the answer to question #5 above) extract all the countries and sort in descending order of deaths. Print out the country code, country name, and deaths. Is this a good way to compare countries?
Make a plot of the total deaths vs country population, and use color to indicate geographical region. Why are most points at the bottom left. Change the plot in such a way so as to see the points spread out. What can be concluded from this plot?
Compute the per capita death rate (in deaths per 100,000 people) and print out the countries with the 10 highest and 10 lowest rates along with their populations, total deaths, and per capita death rate. Compute the per capita death rate of the entire world from this data. Given that this death rate is over three years, how does the COVID death rate compare with mortality from other factors in USA?
Try to get some insight into whether there are differences in the per capita death rate between regions. Experiment with at least two ways of displaying per capita death rates with geographic region. What may be concluded? Discuss confounding factors affecting this analysis.
date
column of the dataset is a special data type
date
for representing dates.class(covid$date)
## [1] "Date"
The package lubridate
(?
lubridate-package) provides several
functions for working with dates. One job is to extract the month, year,
or day from a variable of type
Date
. These functions are
year()
, month()
, and day()
. For
example
covid$date[100200]
## [1] "2022-04-20"
year(covid$date[100200])
## [1] 2022
month(covid$date[100200])
## [1] 4
day(covid$date[100200])
## [1] 20
We could use these functions to add
year
and month
columns to the dataset to be
able to group according to year-month combinations. However, after
computing the average, we would also have to add back a date column so
that we can plot the CFR in time. This can be accomplished in two steps.
The first is to construct a character string representation of the date
from the year
and month
. This can be done with
the paste(..., sep = " ")
function. paste()
takes a set of character strings (...
) and joins them
together using the separator (sep
) and returns the combined
string. For example,
paste("Today", "is", "a", "sunny", "day")
## [1] "Today is a sunny day"
paste("Today", "is", "a", "sunny", "day", sep = "-")
## [1] "Today-is-a-sunny-day"
dateStr <- paste(2020, 2, 1, sep = "-")
dateStr
## [1] "2020-2-1"
In the second step, the character
string representation of the date can be replaced with converted to the
Date
type with the ymd()
function. Note that
ymd()
assumes a “YYYY-MM-DD” format.
ymd(dateStr)
## [1] "2020-02-01"
class(ymd(dateStr))
## [1] "Date"
Use these functions and
dplyr
to compute the monthly CFR. Make sure to exclude
NA
s from the daily cases and daily deaths columns. Make
sure to add back a date column so that your function to plot country
data in time (problem 15 above) can be
used. Plot the monthly CFR for comparable countries (for example, USA,
Canada, and Germany) in time. Explain the trends. Why does the CFR
increase towards the end in 2023. Justify your answer with
data.
Let us directly examine the
relationship between CFR and testing. If there isn’t sufficient testing,
then the number of cases is underestimated, leading to an overestimate
of the CFR. Instead of computing the monthly CFR as done above, compute
the annual CFR. Make sure to exclude missing data (NA
s) for
cases and deaths and also non reporting of tests (daily tests should be
more than 0). In addition to the annual CFR, compute the testing rate,
the ratio of the annual tests divided by the population of the country.
Keep in mind that summarize only keeps the columns corresponding to the
groups and the summary variables you compute. Therefore the population
of the country would have to be included as a summary variable when
summarizing in order to be able to compute the testing rate. Keep only
the data from 2020 and 2021. Assign this tibble to a new variable. Then,
plot the annual CFR vs. the testing rate, showing 2020 and 2021 in
separate colors. And a nonlinear trend line (see Module 1 Study Guide B
or Section 1.2.4 of R
for Data Science (2e)). What may be concluded about the CFR and
testing rate. What level of CFR is expected in countries where testing
is widely available?
What factors might affect the CFR? Sort the data frame in ascending order of CFR and also descending order of testing rate. What are the lowest CFRs? What is notable about the countries with the lowest CFRs?
rollmean(x, k, fill)
function from the zoo
package allows us to slide a window of width k
starting at
each element of vector x
and compute the mean over that
window. The fill
parameter decides what should be filled at
the start and end of the vector when k
elements are not
available. Setting it to NA
fills with NA
s
which will later be ignored. Let’s try it out. Run the code chunk with
different values of k
to see what happens.x <- seq(100)
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
rollmean(x, 2, fill = NA)
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
## [16] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5 30.5
## [31] 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5
## [46] 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5 58.5 59.5 60.5
## [61] 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73.5 74.5 75.5
## [76] 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5 86.5 87.5 88.5 89.5 90.5
## [91] 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5 99.5 NA
Write a function to take a character
vector of countries and make and return a plot of per capita
daily cases, deaths, and vaccinations in time for a given country. In
order to smooth the noisiness, add new columns with rolling means for
daily cases, deaths, and vaccinations over a window of your choosing.
Fill with NA
s. Make sure to filter out NA
s for
cases and deaths. In order to be able to plot all three variables, you
will have to change the tibble to the “long” form with
pivot_longer()
as we did in problem 12 above. The problem you might face is that the
cases, vaccinations, and deaths are on very different scales. Daily
deaths are in the thousands, daily cases are in the hundreds of
thousands, and daily vaccinations are in the millions. This may cause
the deaths to be “squished down” into a flat line. If that happens,
divide the rolling mean of cases and vaccinations by appropriate factors
to match the scale of deaths. Compare the plots of USA and GBR and
interpret.
End of Module 3 HW A