1 Study materials

1.1 Getting ready for the class

1.1.1 Starting an RStudio session on talon

All of the work in this class will be performed in RStudio on the talon high performance computing (HPC) system of UND. Your account has been created. Watch the video and follow the steps below to launch RStudio on talon.

  1. Visit the talon apps webpage at https://apps.talon.und.edu.

  2. Log in with your UND <firstname>.<lastname> login and password.


  1. Click on the talon Apps dropdown menu.


  1. Choose RStudio.


  1. Set R Version to 4.3.0, Number of hours to 2, Number of CPU cores to 1 and RAM to 8 and click the Launch button. Optionally check the I would like to receive an email when the session starts box.


  1. The RStudio Session will launch in a new window/tab of your browser



1.1.2 Setting up an RStudio Project for the course

In the following steps you will create an RStudio Project for yourself. This project will be the main way for you to perform classwork, including study guides, in-class work, and exams.

  1. If you aren’t already in an RStudio session, launch one by following the steps in the starting an RStudio session. Note that Project is None in the top right.


  1. Under the File menu choose New Project....


  1. In Create Project in the New Project Wizard choose Version Control.


  1. In Create Project from Version Control choose Git.


  1. In the Repository URL field under Clone Git Repository enter file:///home/groupdirs/systemsbiology/coursematerials/<firstname>.<lastname>.git, replacing <firstname> with your first name and <lastname> with your last name in all small letters. Be careful about the symbols and capitalization—it has to match exactly.



  1. The Project directory name field should populate automatically when you click inside it. If it does not, enter <firstname>.<lastname>.

  2. The Create project as subdirectory of field should the ~ symbol. If it isn’t, click Browse..., click the Home button under Choose directory, and click the Choose button.


  1. Click the Create Project button.


  1. This will create and open the project up. Note that Project (None) has changed to <firstname>.<lastname> in the top right.



1.1.3 Closing a project

Click Close Project under the file menu.


1.1.4 Opening a project

When returning to classwork, the project needs to be opened in order to work on it.

  1. Start an RStudio session

  2. Check whether the project is already open. If it shows the name of the project (<firstname>.<lastname>) on the top right, then you don’t need to do anything.


  1. If it shows Project (None) on the top right, then open the project by clicking Open Project... under the File menu, navigating to the <firstname>.<lastname> directory in “Home”, clicking on <firstname>.<lastname>.Rproj, and clicking Open.





1.1.5 Receiving new class material and submitting assignments

The class project is in fact a Git version control repository. A version control repository allows us to track changes in software so that we can

  • receive new changes from other people (in this case the instructor). This is called pulling changes.
  • make changes and send them to others. This is called pushing changes.
  • if one makes a mistake, repositories allow us to revert to an older version to undo the mistakes.

1.1.5.1 Receiving new class material by pulling changes from the remote repository

  1. Start an RStudio session and open the project if necessary.

  2. Click on the Tools menu and choose Pull Branches under Version Control.


  1. A window will open showing the files updated in the repository. Click Close.



1.1.5.2 Submitting your work by pushing changes to the remote repository

There are four steps involved in submitting your work

  • Create a new R notebook and save it under Submissions or save an existing notebook under Submissions.

  • Do your work in the R notebook and save it.

  • Commit the changes to your local project/repository.

  • Push the changes to the remote repository to submit your work.

1.1.5.2.1 Create a new R notebook
  1. Start an RStudio session and open the project if necessary.

  2. Let’s create a new R notebook to submit. Click File\(\rightarrow\)New File\(\rightarrow\)R Notebook. This will open a new R notebook Untitled1 in the text editor.


1.1.5.2.2 Save work to Submissions
  1. Submissions should be saved to the module’s subdirectory in the Submissions directory. Click File\(\rightarrow\)Save As... and navigate to the Submissions > Module 1. Enter the name of the new notebook <firstname>.<lastname>-NewNotebook in the File name field, where <firstname> and <lastname> are your first and last names respectively, and click Save.



1.1.5.2.3 Do your work and save it
  1. At the top of the notebook, change the title to <lastname>'s New Notebook and click File\(\rightarrow\)Save to save this change.


1.1.5.2.4 Commit changes to local project/repository
  1. Click Tools\(\rightarrow\)Version Control\(\rightarrow\)Commit....


  1. This will open a new window. The Changes area at the top left shows a list of newly added or modified files. Check the Staged checkbox under each file to be submitted. Type a comment describing the changes under Commit message and click the Commit button. Note: Typing a comment is a must, otherwise you will receive an error message. Once you have completed the assignment, put “Final submission” in the commit message.


  1. This will open a new window showing the changes committed. Click Close.


1.1.5.2.5 Push changes to remote repository to submit your work
  1. So far these changes are only in your copy of the project but are not visible to the instructor. To perform the submission, push the changes to the remote repository. Click Tools\(\rightarrow\)Version Control\(\rightarrow\)Push Branch. This will open a window showing the changes. Click Close once the process is complete.



1.1.6 The workflow during class and out of class

The normal workflow would involve

  1. Start an RStudio session and open the project if necessary. During class, request at most a 2 hour session. Longer sessions may be requested when working outside of class hours.

  2. Pull from the remote repository. Before pulling, make sure that any previous changes made have been committed; otherwise you will receive an error message.

  3. If starting work on a new study guide or homework, save a copy to the Submissions > Module# subdirectory, where # is the module number. Name the submission copy as <firstname>.<lastname>-<filename>.Rmd. For example, a student named John Smith would save the Module 1 study guide A submission as john.smith-Module1StudyGuideA.Rmd in the Submissions > Module1 subdirectory of the project.

Practice saving study guides to Submissions

  1. Do the in-class work or home work, saving the R notebook often.

  2. Whenever you reach a stopping point, even if you’re not finished, commit and push your changes.

  3. When you have completed your work and are ready to submit, enter <name of assignment> final submission in the Commit message. For example, when submitting the Module 1 Study Guide A, the commit message should read Module 1 Study Guide A final submission.

  4. Close the project if desired.


1.1.7 Working within R Studio

Watch this video to learn how to use the R studio integrated development environment (IDE). Use Help\(\rightarrow\)Cheat Sheets\(\rightarrow\)RStudio IDE Cheat Sheet to learn more. The cheatsheet is also available in the CheatSheets subdirectory of the repository.

1.1.8 Working with R Notebooks

The study guides and other assignments will be made available in two forms: an HTML webpage and an R notebook. It will be easier to study the course content on the HTML webpage, the assignments will include hands-on exercises to be performed in the R notebook versions.

Watch this video to learn how to use R notebooks.

Open the R markdown reference guide by clicking on Help\(\rightarrow\)Cheat Sheets\(\rightarrow\)R Markdown Reference Guide.


Practice these skills by carrying out the R markdown and code chunk activities.

1.2 First steps - visualizing the penguins dataset

Read Section 1.2 of R for Data Science (2e).

Carry out the penguin dataset activities.


2 Activities

2.1 Opening the study guide R notebook and saving to submissions

The R notebook for this study guide (Module 1 Study Guide A) is located in StudyGuides > Module 1 > Module1StudyGuideA.Rmd. Click on the Files tab in the Output pane and navigate to StudyGuides > Module 1 and open the study guide.


  1. Follow the instructions for saving your work in the Submissions directory and save the study guide R notebook as <firstname>.<lastname>-Module1StudyGuideA.Rmd.

  2. Follow the instructions to commit and push the new R notebook you have added to Submissions.


2.2 Writing formatted text using R markdown

Refer to the R markdown reference guide and video and perform the following tasks

  1. Write “Systems biology is scheduled for 12:30pm” in bold letters.

  2. Write “RStudion cannot be started on talon” with “not” struck out and “talon” in italics.

  3. Create an ordered list: Produce, Frozen, Meat, Dairy

  4. Create an unordered list: Eggs, Bread, Steak, Cheddar

  5. Write the following equation

  6. Create a table with two columns: Name and Score. The “Name” column should have the names of three students, Alice, Bob, and Charlie, and the “Score” column should contain their scores on a physics exam, 88, 93, and 75 respectively.

  7. Create a level 3 section heading called “Test”.


2.3 Including and executing R code chunks

  1. Create an R code chunk to print out the result of \(2^8 + 2^6 + 2^4 + 2^2 + 2^0\) and execute it.

2.4 Working with the penguins dataset

  1. Load the palmerpenguins library. Note that unless you restart the session, this needs to be only once.
library(palmerpenguins)

  1. Preview the penguins dataset by running the code chuck below.
penguins
  • What form does this dataset take?

  • How many rows and columns does the dataset have? Hint: use the ⏵ and ⏴ buttons to see additional columns

  • Inspect the data; what does each column represent? Hint: you can learn more about functions and datasets by running ?<name> (?penguins).

  • Inspect the data; what might each individual row represent?

  • Formulate a hypothesis about the relationship between flipper length and body mass.


  1. Let’s plot body mass vs. flipper length step by step using a package called ggplot2.
  • First load the package.
library(ggplot2)
  • Let’s use the ggplot() function to initialize the plot. The data argument supplies the data to ggplot.
ggplot(data = penguins)
  • The plot area is empty since we haven’t specified which variables should be plotted on the x- and y-axes. We have to associate or map visual properties of the plot (aesthetics) with the data. The association/map is specified with the mapping argument to ggplot and the visual properties/aesthetics are specified by the aes() function. Multiple arguments are separated by a comma.
ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      )
  • Next, we have to tell ggplot() how to display the data. There are many different geometrical shapes data can be represented with such as bars, points, lines, and so on. One or more geometrical shapes can be associated with the data using geom_ functions. Let’s use geom_point() to tell ggplot() to display the data as points.
ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point()    
  • What can you infer about the relationship between body mass and flipper length based on the plot?

  • Next, one may ask whether the relationship is the same for all species or differs by species. One way to visualize this is to plot different species in different colors. We would like to associate/map a visual property/aesthetic of the points to another variable, in this case the species of a penguin. As before this can be accomplished using the mapping argumentx and the aes() function. The difference is that instead of mapping x or y coordinates, we would map the color.

ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
      ) +
geom_point()    
  • What can you conclude about how body mass differs between the species? Flipper length?

  • Does the relationship between flipper length and body mass differ between species?

  • To better answer the above question, we could add trend lines to the plot. Since trend lines are a geometrical object, we would use a geom_ function. In this case we will use geom_smooth(), which displays a smooth curve. How the smooth curve is computed is determined by the method argument. We will specify "lm" or linear model as the method to compute the curve so that geom_smooth() adds a straight line.

ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
      ) +
geom_point() +
geom_smooth(method = "lm")  
  • Based on the plot above, does the relationship between body mass and flipper length differ by species?

  • When we mapped species to color in ggplot() this mapping applies to—is inherited by—all the geometrical object or geoms. As a result, we get three different sets of points as well as three different sets of lines. If we wanted to have only one line showing the trend over all the data but still wanted to distinguish between different specied in the raw data, we would move the color aesthetic to the points geom.

ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")  
  • As a final step, let’s make this plot easier to understand by giving a title, and labels for the axes and legend.
ggplot(
        data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm") +
labs(
      title = "Body mass vs. flipper length",
      x = "Flipper length (mm)",
      y = "Body mass (g)",
      color = "Species"
    )  

2.5 Additional exercises

  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

  2. Make a code chuck and run the following code in it. Why does the following give an error and how would you fix it? Fix the error and run the code.

ggplot(data = penguins) + 
  geom_point()
  1. Run this code in your head and predict what the output will look like. Then, run it in a code chunk and check your predictions.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
  1. Will these two graphs look different? Why/why not? Check your answer by running the code to create the graphs.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )