Please run this code chunk before you start.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::hms() masks vembedr::hms()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(palmerpenguins)
Read Chapter 25 of R for Data Science (2e).
Carry out the functions activities and exercises.
Functions are one of the most important tools in coding. Defining a function for some task allows us to perform the task in one line code even if the task takes thousands of lines of code to complete. For example, recall the code we wrote in Module 1 Study Guide B to compute the number of molecules of DNA given its weight in grams \(g\) and length \(l\).
g <- 6E-8
l <- 1000
w <- 650*l
m <- g/w
n <- m * 6.023E23
n
## [1] 55596923077
The same task could be completed by first defining a function, let’s
call it convertGramsToMolecules
, and then invoking it or
calling it.
convertGramsToMoles <- function(grams, len)
{
molWeight <- 650*len
moles <- grams/molWeight
return(moles)
}
dnaMoles <- convertGramsToMoles(grams = 6E-11, len = 1000)
dnaMoles
## [1] 9.230769e-17
In the definition above:
function
. The inputs are called the arguments of
the function and allow the function to be flexible and able to complete
a range of tasks instead of a rigid, fixed task.{
and
}
.return()
,
statement of the function. The function is said to return the
output.The function is invoked or called with it’s name followed by the arguments enclosed in parentheses and the output returned by the function can be stored in a variable with the assignment operator.
Functions allow you to reuse your code and offer a number of benefits:
ggplot2
and dplyr
are
just collections or libraries of functions.Inspect the
Environment
tab and note that the function has been added
to the environment. This means that if we would like to repeat the
calculation with a different mass or length, instead of copying the
code, we can just call the function again with different inputs. Note
that once we have entered the first few letters, RStudio automatically
offers to autofill the name, which can be accomplished with the tab key.
Compute the number of human genomes in 60pg of DNA by calling the
convertGramsToMoles()
function with appropriate values of
the arguments.
One may specify default values of
the arguments in the function definition using the =
operator. This has the benefit of simplifying function calls when some
arguments are rarely changed. Rewrite the
convertGramsToMoles()
function to have the len
argument default to the length of the human genome and then call it with
just one argument.
It is possible to call functions
from within functions. This allows us divide and conquer: to break down
a complex task into several simpler ones and then combine them using
function calls. This also makes code more reusable. Write a function to
convert grams of DNA to the number of molecules that reuses the
convertGramstoMoles()
function.
The arguments and outputs can be
any data structure or data type. Write a function that takes the
flights
tibble and an airline code as arguments and returns
the longest departure delay suffered by the airline in the entire
year.
Modify the function you wrote
above to return both the origin airport code and the departure delay.
Use a list (?list
).
You can even return plots
(plot objects). Write a function that takes an airport code and
returns a plot object plotting the average number of flights departing
in each hour during the day. Use require()
to silently load
the nycflights13
library inside the function instead of
passing the flights
tibble as an argument.
The greater the range of input data and tasks that our code works with, the more useful it is. One of the ways of accomplishing this is to write code that changes its behavior based on the context, that is, the values of variables. Changing which lines of code are run or how many times a task is repeated is known as flow control. There are two main ways of controlling the flow of code. The first is to make choices or decisions while the second is to perform repetitive tasks.
if-else
and switch
One can make choices by using the
if-else
conditional statements.
if (condition) true_action
condition
is a Boolean
variable. If the Boolean variable has value TRUE
then
true_action
statements are run. If the condition is false,
then true_action statements are skipped and execution proceeds to the
statement after the if
. Another way of writing it
is
if (condition) true_action else false_action
in which false_action
statements are run if condition
is
FALSE
.
Generate two random numbers using
the rnorm(n, mean, sd)
function and assign them to two
variables. Test whether the first one is greater than the second one and
assign the result to a variable. Print out this variable. Use
if
to print out “Greater” if the first number is greater.
After the if
write a statement to print “Hello”. Run the
code repeatedly.
Change the code above to print out “Lesser than or equal to” if the first number is lesser than or equal to the second one. Run the code repeatedly.
It is possible to execute
multiple statements if the condition is TRUE
or
FALSE
by enclosing them in {
and
}
. Multiple statements enclosed within {}
are
known as compound statements. In addition to printing out “Greater” or
“Lesser than or equal to”, print out the absolute value of the
difference between the two numbers.
if
returns a value
that can be assigned to a variable. Also, one may directly write a
statement that returns a Boolean value in the condition. Generate a
random number using rnorm()
and write a statement to
determine its absolute value.
Recursion. Now that we know how to write conditional statements, we can show the cool feature of recursion. It is possible for a function to call itself. This is known as recursion and makes it possible to write very elegant code. Write a function that takes an integer as an argument and returns the factorial \(n! = n \cdot (n-1) \cdot (n-2) \cdots 3 \cdot 2 \cdot 1\).
It is also possible to test
multiple conditions using else if
. Write a function that
takes a student’s grade in percentage and returns the letter
grade.
When one among many multiple alternatives need to picked based on the value of a character string, the switch statement may be used for a more concise syntax.
switch(<variable>,
<first possible value> = <first return value>,
<second possible value> = <second return value>,
<third possible value> = <third return value>,
.
.
.
stop(<error message string>)
)
Write a function that takes the type
of statistical summary as a character string (mean, standard deviation,
median, or MAD (median absolute deviation)) and the name of a numeric
variable in the penguins
dataset. The function should
compute the desired statistical summary for the given variable for each
combination of species and island and return the tibble. Note:
You cannot use the column name argument directly in dplyr
verbs since it is a string and not the actual column variable.
The get()
function returns the actual column variable given
the name of the column.
There are many situations when some lines of code have to be run repeatedly. For example we may have a vector of numbers and we need to compute the absolute value of each number. We could just repeat the code as many times as the length of the vector.
rVec <- rnorm(n = 3)
absR <- vector(mode = "numeric", length = 3)
absR[1] <- if (rVec[1] >= 0) rVec[1] else -1*rVec[1]
absR[2] <- if (rVec[2] >= 0) rVec[2] else -1*rVec[2]
absR[3] <- if (rVec[3] >= 0) rVec[3] else -1*rVec[3]
rVec
## [1] -0.1719711 -0.1363529 0.6289787
absR
## [1] 0.1719711 0.1363529 0.6289787
but this has several weaknesses: First, the code is not general. If the length of the vector changes, then we have to change the code. Second, the code is impractical since the vector may have millions of elements. Third, copying, pasting, and modifying indices is inefficient as we have to write a lot of code. Fourth, copying, pasting, and modifying indices is error prone. Fortunately, R (and all other programming languages) provides a way to execute one or more lines of code repeatedly until some condition is met. This is known as looping since the flow chart of the code has a loop in it.
One method is with the
for
statement which allows us to loop code a fixed
number of times, once for each element of a vector. The second is with
the while
statement which allows us to loop code a
variable number of times while some condition is
true.
for (<variable> in <vector>) <expression>
<expression>
could
be a single or compound statement.
Rewrite the code for computing
the absolute values of the elements of a numeric vector using a
for
loop, assigning the results to a new
vector.
Compute the sum of the absolute values of the elements of the vector by accumulating the values. This is a common technique, where a variable is initialized and it’s value is updated with each iteration, accumulating some quantity (sum, product, or some other operation).
Write a function that takes a penguin species name and island as arguments and returns how many penguins of the specified species live on the specified island.
Nested loops. Just as we have put
if
inside a for
loop above, it is possible to
put a for
loop inside another for
loop. This
is known as nesting. In fact, like Russian dolls, it is
possible to repeatedly nest loops inside loops. Write a function to
count the number of penguins in each combination of species and island.
Return a data frame with three columns: species
,
island
, and n
. Reuse the function defined
above. Accomplish the same task using dplyr
and compare the
results. Hint 1: the unique()
function takes a
vector and returns a vector containing only the unique values, similar
to the distinct()
verb of dplyr
, except the
latter returns tibbles with unique combinations. Hint 2:
Remember to initialize the data frame using data.frame()
and vector()
.
Many times situations arise when it
is not known in advance how many times the loop should be performed and
the number of times it is performed varies from situation to situation.
For example, we could be simulating population growth with births and
deaths and would like to stop the simulation when the population reaches
equilibrium and don’t know beforehand how many generations it is going
to take to reach equilibrium. For such situations, R (and other
programming languages) provide the while
flow control
loop.
while (condition) expr
expr
, which could be a
compound statement, is executed while the Boolean variable or expression
condition
is TRUE
.
n_0
, growth rate, alpha
,
death rate, beta
, and tolerance, tol
and
computes the population in each successive generation until the
population reaches equilibrium. Use the logistic model in which the
change in population each generation is proportional to the population,
\(n_{g+1} - n_g = r(n_g) \cdot n_g\),
where \(g\) is the generation number,
\(n_g\) is the population in generation
\(g\), and \(r(n_g)\) is the rate of growth. As the
population grows, the rate of growth reduces due to competition for
limited resources so \(r(n_g) = \alpha - \beta
\cdot n_g\). Equilibrium is when the change in population between
successive generations is less than the tolerance. Use the
while
flow control statement to determine when to stop. The
function should return a data frame with two numeric columns: generation
and population. Plot the growth curve for \(n_0 = 10\), \(\alpha = 1\), \(\beta = 0.001\), and tolerance of 1. Change
\(\alpha\), \(\beta\), or the tolerance to see how the
number of iterations of the loop changes.Write a function to take the
number of grams of DNA and the length of the genome as inputs to
determine how many cells need to be lysed to yield that amount. Make
sure to reuse the function defined above. Set the default value of the
genome size to the length of the human genome. Note that the number of
cells is an integer. Look up the ceiling()
and
floor()
functions to round the number. Use the function to
compute how many cells are needed for 60ng of human DNA.
Write a function to take the
species of penguin as input and return the body mass of the smallest
penguin of that species. Look up the function to compute the minimum of
a vector (?min
). Use pipes.
Write a function to take a
numeric vector as an input and scale it’s values to be between 0 and 1
and returns the scaled values. This can be accomplished by subtracting
the minimum from each element and dividing this difference by the
range (the difference between the maximum and minimum). The
functions min()
, max()
, or
range()
might be useful here. Recall that operations
between vectors are carried out elementwise. The
rnorm(n, mean = 0, sd = 1)
function generates a double
precision vector of length n
containing normally
distributed random numbers with mean mean
and standard
deviation sd
. Use rnorm()
to generate a vector
of length 10, assign to variable, and then test your
function.
Write a function to take a data
frame and a column name (string) as inputs and add a new column called
scaledCol
of the scaled values to it. Make sure to reuse
the scaling function you wrote previously. Return the data frame. Check
whether you function works by scaling columns of any of the datasets you
have worked with. Note: You cannot use the column name argument
directly in dplyr
verbs since it is a string and
not the actual column variable. The get()
function returns
the actual column variable given the name of the column.
Write a function that takes an
origin airport as an argument and creates a plot of the average
departure delay and the number of flights at that origin for each
airline and return the plot object. Your function should ignore/remove
airlines with fewer that 1 flight per day. Use require()
to
load the nycflights13
library inside the function.
Note: By default geom_bar()
plots the counts of
each category to make histograms. You can instruct
geom_bar()
to plot the values in the column using the
stat = "identity"
argument.
Write a function to take two
numbers and print out “divisible” if the first number is divisible by
the second or “not divisible” if not. Hint: the modulo
operator (x %% y
) returns the remainder when one number
(x
) is divided by another (y
). Read more on
the help page for arithmetic operators (?Arithmetic
). Test
your function.
Write a function to plot the
distribution of body mass of the penguins in the penguins dataset. The
function should take a character string as an argument. If the value is
“histogram”, plot the distribution as a histogram. If the value is
“density”, then plot the distribution as a density. Use
require()
to silently load palmerpenguins
and
ggplot2
if necessary. Return the plot
object.
The Heaviside function, \(H(x)\) is \(1\) when \(x \geq 0\) and \(0\) when \(x<0\). Write a function that takes a number as input and returns the value of the Heaviside function.
Write a function that takes an integer \(n\) as an argument and computer the sum of all the integers from \(1\) to \(n\), that is \(1 + 2 + \cdots + (n-1) + n\). Use recursion.
Write a function that takes a codon–a character string of length 3–as input and then uses the genetic code to return the IUPAC letter code of the amino acid encoded by the codon. Which flow control statement would work best here? Write the function for 4 different codons.