All of the work in this class will be performed in RStudio on the talon high performance computing (HPC) system of UND. Your account has been created. Watch the video and follow the steps below to launch RStudio on talon.
Visit the talon apps webpage at https://apps.talon.und.edu.
Log in with your UND
<firstname>.<lastname>
login and
password.
talon Apps
dropdown menu.
RStudio
.
R Version
to 4.3.0, Number of hours
to
2, Number of CPU cores
to 1 and RAM
to 8 and
click the Launch button
. Optionally check the
I would like to receive an email when the session starts
box.
In the following steps you will create an RStudio Project for yourself. This project will be the main way for you to perform classwork, including study guides, in-class work, and exams.
Project
is None
in the
top right.
File
menu choose
New Project...
.
Create Project
in the
New Project Wizard
choose
Version Control
.
Create Project from Version Control
choose
Git
.
Repository URL
field under
Clone Git Repository
enter
file:///home/groupdirs/systemsbiology/coursematerials/<firstname>.<lastname>.git
,
replacing <firstname>
with your first name and
<lastname>
with your last name in all small letters.
Be careful about the symbols and capitalization—it has to match
exactly.
The Project directory name
field should populate
automatically when you click inside it. If it does not, enter
<firstname>.<lastname>
.
The Create project as subdirectory of
field should
the ~
symbol. If it isn’t, click Browse...
,
click the Home
button under Choose directory
,
and click the Choose
button.
Create Project
button.
Project (None)
has changed to
<firstname>.<lastname>
in the top right.
Click Close Project
under the file menu.
When returning to classwork, the project needs to be opened in order to work on it.
Check whether the project is already open. If it shows the name
of the project (<firstname>.<lastname>
) on the
top right, then you don’t need to do anything.
Project (None)
on the top right, then open
the project by clicking Open Project...
under the
File
menu, navigating to the
<firstname>.<lastname>
directory in “Home”,
clicking on <firstname>.<lastname>.Rproj
, and
clicking Open
.
The class project is in fact a Git version control repository. A version control repository allows us to track changes in software so that we can
Start an RStudio session and open the project if necessary.
Click on the Tools
menu and choose
Pull Branches
under Version Control
.
Close
.
There are four steps involved in submitting your work
Create a new R notebook and save it under
Submissions
or save an existing notebook under
Submissions
.
Do your work in the R notebook and save it.
Commit the changes to your local project/repository.
Push the changes to the remote repository to submit your work.
Start an RStudio session and open the project if necessary.
Let’s create a new R notebook to submit. Click
File
\(\rightarrow\)New File
\(\rightarrow\)R Notebook
. This
will open a new R notebook Untitled1
in the text
editor.
Submissions
Submissions
directory. Click File
\(\rightarrow\)Save As...
and
navigate to the Submissions > Module 1
. Enter the name
of the new notebook
<firstname>.<lastname>-NewNotebook
in the
File name
field, where <firstname>
and
<lastname>
are your first and last names
respectively, and click Save
.
<lastname>'s New Notebook
and click
File
\(\rightarrow\)Save
to save this
change.
Tools
\(\rightarrow\)Version Control
\(\rightarrow\)Commit...
.
Changes
area at the
top left shows a list of newly added or modified files. Check the
Staged
checkbox under each file to be submitted. Type a
comment describing the changes under Commit message
and
click the Commit
button. Note: Typing a comment is a
must, otherwise you will receive an error message. Once you have
completed the assignment, put “Final submission” in the commit
message.
Close
.
Tools
\(\rightarrow\)Version Control
\(\rightarrow\)Push Branch
. This
will open a window showing the changes. Click Close
once
the process is complete.
The normal workflow would involve
Start an RStudio session and open the project if necessary. During class, request at most a 2 hour session. Longer sessions may be requested when working outside of class hours.
Pull from the remote repository. Before pulling, make sure that any previous changes made have been committed; otherwise you will receive an error message.
If starting work on a new study guide or homework, save a copy to
the Submissions > Module#
subdirectory, where
#
is the module number. Name the submission copy as
<firstname>.<lastname>-<filename>.Rmd
.
For example, a student named John Smith would save the Module 1 study
guide A submission as john.smith-Module1StudyGuideA.Rmd
in
the Submissions > Module1
subdirectory of the
project.
Practice saving study guides to
Submissions
Do the in-class work or home work, saving the R notebook often.
Whenever you reach a stopping point, even if you’re not finished, commit and push your changes.
When you have completed your work and are ready to submit, enter
<name of assignment> final submission
in the
Commit message
. For example, when submitting the Module 1
Study Guide A, the commit message should read
Module 1 Study Guide A final submission
.
Close the project if desired.
Watch this video to learn how to use the R studio integrated
development environment (IDE). Use Help
\(\rightarrow\)Cheat Sheets
\(\rightarrow\)RStudio IDE Cheat Sheet
to learn more. The cheatsheet is also available in the
CheatSheets
subdirectory of the repository.
The study guides and other assignments will be made available in two forms: an HTML webpage and an R notebook. It will be easier to study the course content on the HTML webpage, the assignments will include hands-on exercises to be performed in the R notebook versions.
Watch this video to learn how to use R notebooks.
Open the R markdown reference guide by clicking on
Help
\(\rightarrow\)Cheat Sheets
\(\rightarrow\)R Markdown Reference Guide
.
Practice these skills by carrying out the R markdown and code chunk activities.
Read Section 1.2 of R for Data Science (2e).
Carry out the penguin dataset activities.
The R notebook for this study guide (Module 1 Study Guide A) is
located in
StudyGuides > Module 1 > Module1StudyGuideA.Rmd
.
Click on the Files
tab in the Output
pane and
navigate to StudyGuides > Module 1
and open the study
guide.
Follow the instructions for saving your work in the
Submissions
directory and save the study guide R
notebook as
<firstname>.<lastname>-Module1StudyGuideA.Rmd
.
Follow the instructions to commit and push the new R
notebook you have added to Submissions
.
Refer to the R markdown reference guide and video and perform the following tasks
Write “Systems biology is scheduled for 12:30pm” in bold letters.
Write “RStudion cannot be started on talon” with “not” struck out and “talon” in italics.
Create an ordered list: Produce, Frozen, Meat, Dairy
Create an unordered list: Eggs, Bread, Steak, Cheddar
Write the following equation
Create a table with two columns: Name and Score. The “Name” column should have the names of three students, Alice, Bob, and Charlie, and the “Score” column should contain their scores on a physics exam, 88, 93, and 75 respectively.
Create a level 3 section heading called “Test”.
palmerpenguins
library. Note that unless you restart the session, this needs to be only
once.library(palmerpenguins)
penguins
dataset by running the code chuck below.penguins
What form does this dataset take?
How many rows and columns does the dataset have? Hint: use the ⏵ and ⏴ buttons to see additional columns
Inspect the data; what does each
column represent? Hint: you can learn more about functions and
datasets by running ?<name>
(?penguins
).
Inspect the data; what might each individual row represent?
Formulate a hypothesis about the relationship between flipper length and body mass.
ggplot2
.library(ggplot2)
ggplot()
function to initialize the plot. The data
argument
supplies the data to ggplot
.ggplot(data = penguins)
mapping
argument to ggplot
and the visual properties/aesthetics are specified by the
aes()
function. Multiple arguments are separated by a
comma.ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
ggplot()
how to display the data. There are many different
geometrical shapes data can be represented with such as bars,
points, lines, and so on. One or more geometrical shapes can be
associated with the data using geom_
functions. Let’s use
geom_point()
to tell ggplot()
to display the
data as points.ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
What can you infer about the relationship between body mass and flipper length based on the plot?
Next, one may ask whether the
relationship is the same for all species or differs by species. One way
to visualize this is to plot different species in different colors. We
would like to associate/map a visual property/aesthetic of the points to
another variable, in this case the species of a penguin. As before this
can be accomplished using the mapping
argumentx and the
aes()
function. The difference is that instead of mapping
x
or y
coordinates, we would map the
color
.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()
What can you conclude about how body mass differs between the species? Flipper length?
Does the relationship between flipper length and body mass differ between species?
To better answer the above
question, we could add trend lines to the plot. Since trend lines are a
geometrical object, we would use a geom_
function.
In this case we will use geom_smooth()
, which displays a
smooth curve. How the smooth curve is computed is determined by the
method
argument. We will specify "lm"
or
linear model as the method to compute the curve so that
geom_smooth()
adds a straight line.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth(method = "lm")
Based on the plot above, does the relationship between body mass and flipper length differ by species?
When we mapped
species
to color in ggplot()
this mapping
applies to—is inherited by—all the geometrical object or geoms.
As a result, we get three different sets of points as well as three
different sets of lines. If we wanted to have only one line showing the
trend over all the data but still wanted to distinguish between
different specied in the raw data, we would move the color
aesthetic to the points geom.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass vs. flipper length",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species"
)
Make a scatterplot of
bill_depth_mm
vs. bill_length_mm
. That is,
make a scatterplot with bill_depth_mm
on the y-axis and
bill_length_mm
on the x-axis. Describe the relationship
between these two variables.
Make a code chuck and run the following code in it. Why does the following give an error and how would you fix it? Fix the error and run the code.
ggplot(data = penguins) +
geom_point()
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)