Please run this code chunk before you start.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ lubridate::hms() masks vembedr::hms()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(palmerpenguins)

1 Study materials

1.1 Functions

Read Chapter 25 of R for Data Science (2e).

Carry out the functions activities and exercises.


1.2 Flow control

Read Chapter 5 of Advanced R (2e).

1.2.1 Making choices with if-else and switch

1.2.2 Performing repetitive tasks by looping

Carry out the flow control activities and exercises.


2 Activities

2.1 Functions

Functions are one of the most important tools in coding. Defining a function for some task allows us to perform the task in one line code even if the task takes thousands of lines of code to complete. For example, recall the code we wrote in Module 1 Study Guide B to compute the number of molecules of DNA given its weight in grams \(g\) and length \(l\).

g <- 6E-8
l <- 1000
w <- 650*l
m <- g/w
n <- m * 6.023E23
n
## [1] 55596923077

The same task could be completed by first defining a function, let’s call it convertGramsToMolecules, and then invoking it or calling it.

convertGramsToMoles <- function(grams, len)
{
  
  molWeight <- 650*len
  moles <- grams/molWeight
  
  return(moles)

}

dnaMoles <- convertGramsToMoles(grams = 6E-11, len = 1000)
dnaMoles
## [1] 9.230769e-17

In the definition above:

  • The inputs to the function are specified in parentheses after function. The inputs are called the arguments of the function and allow the function to be flexible and able to complete a range of tasks instead of a rigid, fixed task.
  • The term to left of the assignment operator is the name of the function, which is used to call it.
  • The function takes the values of the arguments as input and performs all the steps written within the curly braces { and }.
  • The output is specified by the last, return(), statement of the function. The function is said to return the output.

The function is invoked or called with it’s name followed by the arguments enclosed in parentheses and the output returned by the function can be stored in a variable with the assignment operator.

Functions allow you to reuse your code and offer a number of benefits:

  • As requirements change, you only need to update code in one place, instead of many.
  • You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
  • It makes it easier to reuse work from project-to-project, increasing your productivity over time.
  • Allows people to share code with each other. In fact, the libraries that we load such as ggplot2 and dplyr are just collections or libraries of functions.
  • You can give a function an evocative name that makes your code easier to understand.
  1. Inspect the Environment tab and note that the function has been added to the environment. This means that if we would like to repeat the calculation with a different mass or length, instead of copying the code, we can just call the function again with different inputs. Note that once we have entered the first few letters, RStudio automatically offers to autofill the name, which can be accomplished with the tab key. Compute the number of human genomes in 60pg of DNA by calling the convertGramsToMoles() function with appropriate values of the arguments.

  2. One may specify default values of the arguments in the function definition using the = operator. This has the benefit of simplifying function calls when some arguments are rarely changed. Rewrite the convertGramsToMoles() function to have the len argument default to the length of the human genome and then call it with just one argument.

  3. It is possible to call functions from within functions. This allows us divide and conquer: to break down a complex task into several simpler ones and then combine them using function calls. This also makes code more reusable. Write a function to convert grams of DNA to the number of molecules that reuses the convertGramstoMoles() function.

  4. The arguments and outputs can be any data structure or data type. Write a function that takes the flights tibble and an airline code as arguments and returns the longest departure delay suffered by the airline in the entire year.

  5. Modify the function you wrote above to return both the origin airport code and the departure delay. Use a list (?list).

  6. You can even return plots (plot objects). Write a function that takes an airport code and returns a plot object plotting the average number of flights departing in each hour during the day. Use require() to silently load the nycflights13 library inside the function instead of passing the flights tibble as an argument.


2.2 Flow control

The greater the range of input data and tasks that our code works with, the more useful it is. One of the ways of accomplishing this is to write code that changes its behavior based on the context, that is, the values of variables. Changing which lines of code are run or how many times a task is repeated is known as flow control. There are two main ways of controlling the flow of code. The first is to make choices or decisions while the second is to perform repetitive tasks.

2.2.1 Making choices with if-else and switch

One can make choices by using the if-else conditional statements.

if (condition) true_action

condition is a Boolean variable. If the Boolean variable has value TRUE then true_action statements are run. If the condition is false, then true_action statements are skipped and execution proceeds to the statement after the if. Another way of writing it is

if (condition) true_action else false_action

in which false_action statements are run if condition is FALSE.

  1. Generate two random numbers using the rnorm(n, mean, sd) function and assign them to two variables. Test whether the first one is greater than the second one and assign the result to a variable. Print out this variable. Use if to print out “Greater” if the first number is greater. After the if write a statement to print “Hello”. Run the code repeatedly.

  2. Change the code above to print out “Lesser than or equal to” if the first number is lesser than or equal to the second one. Run the code repeatedly.

  3. It is possible to execute multiple statements if the condition is TRUE or FALSE by enclosing them in { and }. Multiple statements enclosed within {} are known as compound statements. In addition to printing out “Greater” or “Lesser than or equal to”, print out the absolute value of the difference between the two numbers.

  4. if returns a value that can be assigned to a variable. Also, one may directly write a statement that returns a Boolean value in the condition. Generate a random number using rnorm() and write a statement to determine its absolute value.

  5. Recursion. Now that we know how to write conditional statements, we can show the cool feature of recursion. It is possible for a function to call itself. This is known as recursion and makes it possible to write very elegant code. Write a function that takes an integer as an argument and returns the factorial \(n! = n \cdot (n-1) \cdot (n-2) \cdots 3 \cdot 2 \cdot 1\).

  6. It is also possible to test multiple conditions using else if. Write a function that takes a student’s grade in percentage and returns the letter grade.

  7. When one among many multiple alternatives need to picked based on the value of a character string, the switch statement may be used for a more concise syntax.

switch(<variable>,
        <first possible value> = <first return value>,
        <second possible value> = <second return value>,
        <third possible value> = <third return value>,
        .
        .
        .
        stop(<error message string>)
      )

Write a function that takes the type of statistical summary as a character string (mean, standard deviation, median, or MAD (median absolute deviation)) and the name of a numeric variable in the penguins dataset. The function should compute the desired statistical summary for the given variable for each combination of species and island and return the tibble. Note: You cannot use the column name argument directly in dplyr verbs since it is a string and not the actual column variable. The get() function returns the actual column variable given the name of the column.

2.2.2 Performing repetitive tasks by looping

There are many situations when some lines of code have to be run repeatedly. For example we may have a vector of numbers and we need to compute the absolute value of each number. We could just repeat the code as many times as the length of the vector.

rVec <- rnorm(n = 3)

absR <- vector(mode = "numeric", length = 3)

absR[1] <- if (rVec[1] >= 0) rVec[1] else -1*rVec[1]
absR[2] <- if (rVec[2] >= 0) rVec[2] else -1*rVec[2]
absR[3] <- if (rVec[3] >= 0) rVec[3] else -1*rVec[3]

rVec
## [1] -0.1719711 -0.1363529  0.6289787
absR
## [1] 0.1719711 0.1363529 0.6289787

but this has several weaknesses: First, the code is not general. If the length of the vector changes, then we have to change the code. Second, the code is impractical since the vector may have millions of elements. Third, copying, pasting, and modifying indices is inefficient as we have to write a lot of code. Fourth, copying, pasting, and modifying indices is error prone. Fortunately, R (and all other programming languages) provides a way to execute one or more lines of code repeatedly until some condition is met. This is known as looping since the flow chart of the code has a loop in it.

By Giacomo Alessandroni - File:For-loop-diagram.svg, CC BY-SA 4.0
By Giacomo Alessandroni - File:For-loop-diagram.svg, CC BY-SA 4.0

One method is with the for statement which allows us to loop code a fixed number of times, once for each element of a vector. The second is with the while statement which allows us to loop code a variable number of times while some condition is true.

for (<variable> in <vector>) <expression>

<expression> could be a single or compound statement.

  1. Rewrite the code for computing the absolute values of the elements of a numeric vector using a for loop, assigning the results to a new vector.

  2. Compute the sum of the absolute values of the elements of the vector by accumulating the values. This is a common technique, where a variable is initialized and it’s value is updated with each iteration, accumulating some quantity (sum, product, or some other operation).

  3. Write a function that takes a penguin species name and island as arguments and returns how many penguins of the specified species live on the specified island.

  4. Nested loops. Just as we have put if inside a for loop above, it is possible to put a for loop inside another for loop. This is known as nesting. In fact, like Russian dolls, it is possible to repeatedly nest loops inside loops. Write a function to count the number of penguins in each combination of species and island. Return a data frame with three columns: species, island, and n. Reuse the function defined above. Accomplish the same task using dplyr and compare the results. Hint 1: the unique() function takes a vector and returns a vector containing only the unique values, similar to the distinct() verb of dplyr, except the latter returns tibbles with unique combinations. Hint 2: Remember to initialize the data frame using data.frame() and vector().

Many times situations arise when it is not known in advance how many times the loop should be performed and the number of times it is performed varies from situation to situation. For example, we could be simulating population growth with births and deaths and would like to stop the simulation when the population reaches equilibrium and don’t know beforehand how many generations it is going to take to reach equilibrium. For such situations, R (and other programming languages) provide the while flow control loop.

while (condition) expr

expr, which could be a compound statement, is executed while the Boolean variable or expression condition is TRUE.

  1. Write a function that takes a starting population n_0, growth rate, alpha, death rate, beta, and tolerance, tol and computes the population in each successive generation until the population reaches equilibrium. Use the logistic model in which the change in population each generation is proportional to the population, \(n_{g+1} - n_g = r(n_g) \cdot n_g\), where \(g\) is the generation number, \(n_g\) is the population in generation \(g\), and \(r(n_g)\) is the rate of growth. As the population grows, the rate of growth reduces due to competition for limited resources so \(r(n_g) = \alpha - \beta \cdot n_g\). Equilibrium is when the change in population between successive generations is less than the tolerance. Use the while flow control statement to determine when to stop. The function should return a data frame with two numeric columns: generation and population. Plot the growth curve for \(n_0 = 10\), \(\alpha = 1\), \(\beta = 0.001\), and tolerance of 1. Change \(\alpha\), \(\beta\), or the tolerance to see how the number of iterations of the loop changes.

2.3 Additional exercises

2.3.1 Functions

  1. Write a function to take the number of grams of DNA and the length of the genome as inputs to determine how many cells need to be lysed to yield that amount. Make sure to reuse the function defined above. Set the default value of the genome size to the length of the human genome. Note that the number of cells is an integer. Look up the ceiling() and floor() functions to round the number. Use the function to compute how many cells are needed for 60ng of human DNA.

  2. Write a function to take the species of penguin as input and return the body mass of the smallest penguin of that species. Look up the function to compute the minimum of a vector (?min). Use pipes.

  3. Write a function to take a numeric vector as an input and scale it’s values to be between 0 and 1 and returns the scaled values. This can be accomplished by subtracting the minimum from each element and dividing this difference by the range (the difference between the maximum and minimum). The functions min(), max(), or range() might be useful here. Recall that operations between vectors are carried out elementwise. The rnorm(n, mean = 0, sd = 1) function generates a double precision vector of length n containing normally distributed random numbers with mean mean and standard deviation sd. Use rnorm() to generate a vector of length 10, assign to variable, and then test your function.

  4. Write a function to take a data frame and a column name (string) as inputs and add a new column called scaledCol of the scaled values to it. Make sure to reuse the scaling function you wrote previously. Return the data frame. Check whether you function works by scaling columns of any of the datasets you have worked with. Note: You cannot use the column name argument directly in dplyr verbs since it is a string and not the actual column variable. The get() function returns the actual column variable given the name of the column.

  5. Write a function that takes an origin airport as an argument and creates a plot of the average departure delay and the number of flights at that origin for each airline and return the plot object. Your function should ignore/remove airlines with fewer that 1 flight per day. Use require() to load the nycflights13 library inside the function. Note: By default geom_bar() plots the counts of each category to make histograms. You can instruct geom_bar() to plot the values in the column using the stat = "identity" argument.


2.4 Flow control

  1. Write a function to take two numbers and print out “divisible” if the first number is divisible by the second or “not divisible” if not. Hint: the modulo operator (x %% y) returns the remainder when one number (x) is divided by another (y). Read more on the help page for arithmetic operators (?Arithmetic). Test your function.

  2. Write a function to plot the distribution of body mass of the penguins in the penguins dataset. The function should take a character string as an argument. If the value is “histogram”, plot the distribution as a histogram. If the value is “density”, then plot the distribution as a density. Use require() to silently load palmerpenguins and ggplot2 if necessary. Return the plot object.

  3. The Heaviside function, \(H(x)\) is \(1\) when \(x \geq 0\) and \(0\) when \(x<0\). Write a function that takes a number as input and returns the value of the Heaviside function.

  4. Write a function that takes an integer \(n\) as an argument and computer the sum of all the integers from \(1\) to \(n\), that is \(1 + 2 + \cdots + (n-1) + n\). Use recursion.

  5. Write a function that takes a codon–a character string of length 3–as input and then uses the genetic code to return the IUPAC letter code of the amino acid encoded by the codon. Which flow control statement would work best here? Write the function for 4 different codons.