0. Introduction to Scripts

This is a Quarto Markdown document, which we’ll learn more about next week. This week we’re working in scripts, which allow you to write your code (recipe) and run the code in the console (kitchen).

R considers everything in the script as code to run, so you can write comments in the R Script by putting a pound sign at the beginning of the line. This is especially useful when you want to explain what your code is doing at each line in plain language.

1. Assigning values to objects

We’ll start with some basics. We’ll assign values to objects.

Code

# assign the number 5 to an object called snail_length
snail_length <- 5

Code

# print snail_length
snail_length

[1] 5

You’ll see the output of this in the console, not your script.

Now that you’ve assigned this value to an object, you can start to work with it. Let’s see what snail_length/2 is.

Code

snail_length/2

[1] 2.5

This doesn’t change the value of snail_length - check this in the console.

Code

snail_length

[1] 5

You can save this new variable as another object.

Code

half <- snail_length/2

2. Using functions

Functions are where R gets interesting. R allows you to apply functions to do calculations, from simple to complex structures.

We can start by calculating the square root of snail_length.

Code

sqrt(snail_length)

[1] 2.236068

We might not want all the digits in that calculation, so we could round it using the round() function.

Code

round(sqrt(snail_length))

[1] 2

This rounds snail_length to 4. However, we want to be a little more precise than that. Check out what round() does in the console by typing ?round.

Let’s round snail_length to 3 digits instead of the next whole number.

Code

round(sqrt(snail_length), digits = 3)

[1] 2.236

3. Basic sorting and filtering

Now, let’s try a vector of numbers. Let’s say that we measured a bunch of different fish and recorded their weights in kilograms.

Code

fish_weights <- c(1, 2, 3, 1, 2)

Let’s say “small” fish are any fish that are < 2 kilograms. We want to know the weights of all the “small” fish that we collected.

Code

fish_weights[fish_weights < 2]

[1] 1 1

What if we want all the “big” fish?

Code

fish_weights[fish_weights > 2]

[1] 3

4. Packages

Packages (or libraries) have functions that aren’t already built into R.

You can install packages in one of two ways. The first (most common) way is to use the functioninstall.packages(). This is for any package that is on CRAN, or the Comprehensive R Archive Network. Try installing the package {tidyverse} using the following command: install.packages("tidyverse").

Now you have a package installed! But you now need to “load it in” to your environment. Installing a package is like buying a pan - you only need to do it once if you want to cook. However, you still need to put the pan on the stove in order to start cooking.

You can load in any package using the function library(). Try loading in the package below.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Nothing shows up once you’ve loaded in the package, but now you’re ready to use the functions in it!

5. working with data in R

Later in the quarter, we’ll work with data sets from real examples (i.e. from research). To get acquainted with how to work with data in R, we’ll use some of the built-in examples. Go to the documentation to see the list of data sets that are pre-installed with R. The topics are all over the place, but they are useful for testing things out if the data you have to work with is big and unwieldy.

One of the packages that has a cool dataset to test things out with is called {palmerpenguins}. Install it in your console and load it in to your environment.

What is {palmerpenguins}? Read about it here.
The first step to using data is looking at it! Use View(penguins) to see what it is.
(Hint: did that not work? Remember to load in the package before you start using it.)

Code

library(palmerpenguins)
penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g

penguins is a data frame. Data frames have rows and columns, and their cells contain data. In this case, this data frame has 8 columns and 344 rows, which you can see in the visual display.

Figure out what the columns are by using colnames(penguins).

Code

colnames(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

Write out 1) the column name, 2) what type of variable it is, and 3) what data are in them. For example:
- species: categorical, penguin species
- island: categorical, islands were penguins were sampled
- bill_length_mm: continuous, bill length in mm
- bill_depth_mm: continuous, bill depth in mm
- flipper_length_mm: continuous, flipper length in mm
- body_mass_g: continuous, body mass in grams
- sex: categorical, male or female
- year: categorical (ordinal), year sampled

You can learn about the structure of a data frame by running the function str(). What is the output for that?

Code

str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Let’s figure out some basic information about the data set. What’s the longest bill length they measured on a penguin? Save that as an object called long_bill.

Code

long_bill <- 59.6

We did that visually, but you can do that in code. The function max() allows you to get the maximum number in a vector, which is a list of numbers. (Note: how would you double check how the function works if you hadn’t used it before?)

Code

max(penguins$bill_length_mm)

[1] NA

Huh. That was weird. We knew the longest bill was 59.6, but why does this say NA?

Code

max(penguins$bill_length_mm, na.rm = TRUE)

[1] 59.6

That’s a lot better!

Try finding the minimum bill length, and saving that as an object called short_bill.

Code

min(penguins$bill_length_mm, na.rm = TRUE)

[1] 32.1

6. data exploration

Let’s say you think the three different penguin species have different body masses, on average. This is where the {tidyverse} package we were using before comes in handy.

We know that there’s a column in the data frame that has species, and another column that has body masses. So if there’s a way we can get all the rows belonging to a species, then take all the numbers for body mass and average them, we can figure out the average body mass for a penguin species in the sample.

There are tidyverse functions that can help with that:
- group_by(): identifying natural groups in the data frame (categorical variables)
- summarize(): summarizes the data based on what you want
- %>%: a very!!! useful operator (not a function). This is called a “pipe” and it allows you to string functions together. You’re basically telling R, “… and then”. An example below:

Code

# tell R what data frame you want to use
penguins %>% 
  # and then, group the data frame by species
  group_by(species) %>% 
  # and then, summarize: create a new column called `mean_body_mass` from body_mass_g
  summarize(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

# A tibble: 3 × 2
  species   mean_body_mass
  <fct>              <dbl>
1 Adelie             3701.
2 Chinstrap          3733.
3 Gentoo             5076.

Try figuring out what the maximum flipper length is by island.

Code

# tell R that you want to use the data frame penguins
penguins %>% 
  # and then, group the data frame by island
  group_by(island) %>% 
  # and then, summarize: create a new column called 'max_flipper_length' from flipper_length_mm
  summarize(max_flipper_length = max(flipper_length_mm, na.rm = TRUE))

# A tibble: 3 × 2
  island    max_flipper_length
  <fct>                  <int>
1 Biscoe                   231
2 Dream                    212
3 Torgersen                210

You can also group by multiple columns.

Code

# use penguins
penguins %>% 
  # group by island, then species
  group_by(island, species) %>% 
  # summarize: get max flipper length
  summarize(max_flipper_length = max(flipper_length_mm, na.rm = TRUE))

`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
# Groups:   island [3]
  island    species   max_flipper_length
  <fct>     <fct>                  <int>
1 Biscoe    Adelie                   203
2 Biscoe    Gentoo                   231
3 Dream     Adelie                   208
4 Dream     Chinstrap                212
5 Torgersen Adelie                   210

What if you only want Biscoe island?
- filter(): filters a data frame by data in a column

Code

# use the penguins data frame
penguins %>% 
  # filter the data frame to only include Biscoe Island
  filter(island == "Biscoe") %>% 
  # group by species
  group_by(species) %>%
  # calculate mean body mass
  summarize(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

# A tibble: 2 × 2
  species mean_body_mass
  <fct>            <dbl>
1 Adelie           3710.
2 Gentoo           5076.

Citation

BibTeX citation:

@online{bui2023,
  author = {Bui, An},
  title = {Coding Workshop: {Week} 1},
  date = {2023-04-05},
  url = {https://an-bui.github.io/ES-193DS-W23/workshop/workshop-01_2023-04-05.html},
  langid = {en}
}

For attribution, please cite this work as:

Bui, An. 2023. “Coding Workshop: Week 1.” April 5, 2023. https://an-bui.github.io/ES-193DS-W23/workshop/workshop-01_2023-04-05.html.