Code
# assign the number 5 to an object called snail_length
<- 5 snail_length
{tidyverse}
and {palmerpenguins}
April 5, 2023
This is a Quarto Markdown document, which we’ll learn more about next week. This week we’re working in scripts, which allow you to write your code (recipe) and run the code in the console (kitchen).
R considers everything in the script as code to run, so you can write comments in the R Script by putting a pound sign at the beginning of the line. This is especially useful when you want to explain what your code is doing at each line in plain language.
We’ll start with some basics. We’ll assign values to objects.
You’ll see the output of this in the console, not your script.
Now that you’ve assigned this value to an object, you can start to work with it. Let’s see what snail_length/2
is.
This doesn’t change the value of snail_length - check this in the console.
You can save this new variable as another object.
Functions are where R gets interesting. R allows you to apply functions to do calculations, from simple to complex structures.
We can start by calculating the square root of snail_length.
We might not want all the digits in that calculation, so we could round it using the round()
function.
This rounds snail_length
to 4. However, we want to be a little more precise than that. Check out what round()
does in the console by typing ?round
.
Let’s round snail_length
to 3 digits instead of the next whole number.
Now, let’s try a vector of numbers. Let’s say that we measured a bunch of different fish and recorded their weights in kilograms.
Let’s say “small” fish are any fish that are < 2 kilograms. We want to know the weights of all the “small” fish that we collected.
What if we want all the “big” fish?
Packages (or libraries) have functions that aren’t already built into R.
You can install packages in one of two ways. The first (most common) way is to use the functioninstall.packages()
. This is for any package that is on CRAN, or the Comprehensive R Archive Network. Try installing the package {tidyverse}
using the following command: install.packages("tidyverse")
.
Now you have a package installed! But you now need to “load it in” to your environment. Installing a package is like buying a pan - you only need to do it once if you want to cook. However, you still need to put the pan on the stove in order to start cooking.
You can load in any package using the function library()
. Try loading in the package below.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.0
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Nothing shows up once you’ve loaded in the package, but now you’re ready to use the functions in it!
Later in the quarter, we’ll work with data sets from real examples (i.e. from research). To get acquainted with how to work with data in R, we’ll use some of the built-in examples. Go to the documentation to see the list of data sets that are pre-installed with R. The topics are all over the place, but they are useful for testing things out if the data you have to work with is big and unwieldy.
One of the packages that has a cool dataset to test things out with is called {palmerpenguins}
. Install it in your console and load it in to your environment.
What is {palmerpenguins}
? Read about it here.
The first step to using data is looking at it! Use View(penguins)
to see what it is.
(Hint: did that not work? Remember to load in the package before you start using it.)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
penguins
is a data frame. Data frames have rows and columns, and their cells contain data. In this case, this data frame has 8 columns and 344 rows, which you can see in the visual display.
Figure out what the columns are by using colnames(penguins)
.
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
Write out 1) the column name, 2) what type of variable it is, and 3) what data are in them. For example:
- species: categorical, penguin species
- island: categorical, islands were penguins were sampled
- bill_length_mm: continuous, bill length in mm
- bill_depth_mm: continuous, bill depth in mm
- flipper_length_mm: continuous, flipper length in mm
- body_mass_g: continuous, body mass in grams
- sex: categorical, male or female
- year: categorical (ordinal), year sampled
You can learn about the structure of a data frame by running the function str()
. What is the output for that?
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Let’s figure out some basic information about the data set. What’s the longest bill length they measured on a penguin? Save that as an object called long_bill
.
We did that visually, but you can do that in code. The function max()
allows you to get the maximum number in a vector, which is a list of numbers. (Note: how would you double check how the function works if you hadn’t used it before?)
Huh. That was weird. We knew the longest bill was 59.6, but why does this say NA?
That’s a lot better!
Try finding the minimum bill length, and saving that as an object called short_bill
.
Let’s say you think the three different penguin species have different body masses, on average. This is where the {tidyverse}
package we were using before comes in handy.
We know that there’s a column in the data frame that has species, and another column that has body masses. So if there’s a way we can get all the rows belonging to a species, then take all the numbers for body mass and average them, we can figure out the average body mass for a penguin species in the sample.
There are tidyverse functions that can help with that:
- group_by()
: identifying natural groups in the data frame (categorical variables)
- summarize()
: summarizes the data based on what you want
- %>%
: a very!!! useful operator (not a function). This is called a “pipe” and it allows you to string functions together. You’re basically telling R, “… and then”. An example below:
# A tibble: 3 × 2
species mean_body_mass
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
Try figuring out what the maximum flipper length is by island.
# A tibble: 3 × 2
island max_flipper_length
<fct> <int>
1 Biscoe 231
2 Dream 212
3 Torgersen 210
You can also group by multiple columns.
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: island [3]
island species max_flipper_length
<fct> <fct> <int>
1 Biscoe Adelie 203
2 Biscoe Gentoo 231
3 Dream Adelie 208
4 Dream Chinstrap 212
5 Torgersen Adelie 210
What if you only want Biscoe island?
- filter()
: filters a data frame by data in a column
# A tibble: 2 × 2
species mean_body_mass
<fct> <dbl>
1 Adelie 3710.
2 Gentoo 5076.
@online{bui2023,
author = {Bui, An},
title = {Coding Workshop: {Week} 1},
date = {2023-04-05},
url = {https://an-bui.github.io/ES-193DS-W23/workshop/workshop-01_2023-04-05.html},
langid = {en}
}