Read Chapter 2 of R for Data Science (2e).

Carry out the data types and data structures activities and exercises.

Read Sections 1.3–1.7 of R for Data Science (2e).

Carry out the penguin dataset activities and exercies.

Refer to the data types and data structures section and perform the following tasks.

*Numeric/double precision*variables can store real numbers (with a decimal point). The data type of a variable can be determined using the`typeof`

function. Check the data type of the`numHydAtoms`

variable.*Integer*variables can store integers only. Assign the value`-34L`

to a variable`numStudents`

and check its data type.*Logical*variables can only take two values,`TRUE`

or`FALSE`

. Assign the value`TRUE`

to a variable`passed`

and check its data type.*Character*variables store one or more characters. Two or more characters are known as a*string*. One important consideration in working with character data is that character strings are also used for in-built R commands, variables, and functions. For example, we may want to store the string`James`

in a variable called`studentName`

. In order to distinguish between strings that are R variables/commands or data (also called “literal strings”), the latter are enclosed in quotes (`"`

). Assign the literal string`"James"`

to the variable`studentName`

. Check the variable’s data type.

Assign `James`

to the
`studentName`

variable without quotes. What is the error?
Why?

Very often it is necessary to deal with a large number of variables. For example, we may want to store the scores of 100 students in a class. It would be very unwieldy to have to define and manipulate 100 different variables, one for each student’s score. R provides

*vectors*, which is a structured collection of variables, or a*data structure*. The vector stores many variables under one name and each one may be accessed by its position or*index*in the vector. Vectors may be created by the combine`c()`

function. Create a numeric vector called`student_scores`

with exam scores of 10 students using`c()`

. Check the data type.The

`i`

th value of a vector may be accessed as`<vector_name>[i]`

. A range of values from the`i`

th to the`j`

th position may be accessed by`<vector_name>[i:j]`

. Print out the 6th score. Print out the 2nd through the 5th scores.It is possible to have vectors of other data types as well. Use

`c()`

to create a vector of 10 student names (`A`

through`J`

) called`student_names`

. Check its data type. Print out the 2nd name and the 6th through 9th names.A very important data structure is the

`data.frame`

which is a collection of named vectors of the same length. It takes the form of a table, having the same number of columns as the component vectors and the same number of rows as the length of the vectors. Create a data frame called`student_results`

with two columns,`names`

and`scores`

, containing student names and scores respectively.There are a number of methods available to inspect data frames.

`nrow()`

and`ncol()`

determine the number of rows and columns respectively.`summary()`

provides an overview and`head()`

prints out the first few rows. Use these functions on the`student_results`

data frame to inspect it.There are many ways to access the values in a data frame. An individual element in row

`i`

and column`j`

can be accessed by`<data_frame_name>[i,j]`

. An entire column may be accessed by`<data_frame_name>$<column_name>`

or`<data_frame_name>[,j]`

. An entire row may be accessed by`<data_frame_name>[i,]`

. Print out the 5th student’s score. Print out all the scores. Print out the name and score of the last studemt.

Load the libraries.

```
library(palmerpenguins)
library(ggplot2)
```

`ggplot`

calls conciselyRefer to Section 1.3 of R for Data Science (2e).

So long as the correct order of function arguments is observed, the
name of the arguments (`mapping`

, `data`

etc) may
be omitted, shortening the function calls.

Therefore, the code

```
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
```

can be shortened to

```
ggplot(
penguins,
aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
```

**Note that changing the order will result in an error
message**.

- Refer to Section 1.4 of R for Data Science (2e).

One of the first steps in data analysis is to understand the
*probability* distribution of a random variable. For example, in
the penguins dataset, the species variable can take three values:
Adelie, Chinstrap, and Gentoo. The probability distribution in this case
is a set of three numbers giving the probabilities of a penguin being
Adelie, Chinstrap, or Gentoo. We can empirically estimate the
probabilities by computing the frequency of each value of the variable.
A common way to visualize the frequency distribution of
*categorical*/*factor* variables is with a bar chart.
Since bars are geometrical objects, we have to use one of the
`geom_`

functions, in this case `geom_bar()`

to
plot them.

- Use
`geom_bar()`

to plot the frequency distribution of the species variable as a bar chart. Note that we will only specify the mapping of the species variable to the x-axis since the frequency, which is plotted on the y-axis, is not a variable in the dataset but is computed automatically by`geom_bar()`

.

The distribution of a continuously varying, *numerical*,
variable is visualized as a *histogram*. As one would expect this
geometrical object is created by a dedicated function called
`geom_histogram()`

. Histograms subdivide the range of values
a variable takes into bins and count the frequency of occurrence of the
variable within each bin. `geom_histogram()`

takes an
argument `bins`

, which specifies the number of bins the range
of values is subdivided into.

- Use
`geom_histogram()`

to create a histogram of the body weight of the penguins with 30 bins. What can you discern about the penguins’ body mass from this plot?

- Imagine that you keep reducing the
bin width to increase the number of bins. As you do so, you will discern
more and more detail in the histogram. In the limit of zero bin width,
the histogram will approach the
*probability density function*. This can be visualized with the`geom_density()`

function. Use the`geom_density()`

function to visualize the probability density function of the body mass variable. How does the density compare to the histogram above?

Refer to Section 1.5 of R for Data Science (2e).

How do we visualize relationships between two or more variables? Appropriate visualization depend on what type the two variables are.

The first case is when one variable is a continuously varying
numerical variable, such as penguin body mass, and the other is a
discrete, categorical variable such as penguin species. One way to
visualize such a relatioship is to use *box plots*.