Resources

I urge you to create a cheat sheet of all the functions you have seen and are responsible for in lecture. Dr. Riddell is very thorough in telling you which ones you need. Those are the only notes you need to really take when it comes to coding.

This is all you need for dplyr. https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

This is all you need for ggplot2. https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Data

We are going to use some data that is always available in R. Notice that it’s not in our environment. Let’s take a look at the data.

# View(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

What class is iris? Let’s use class(), an R function, to figure it out!

class(iris)
## [1] "data.frame"

How can we explore our data? We learned functions in lecture and from our previous assignment.

# * WHAT ELSE CAN WE USE ON IRIS TO LEARN ABOUT IT?
dim(iris)
## [1] 150   5
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Right now, say I only want some of this dataframe. How do I use dplyr functions to cut down the information?

Notice: We are using %>% here because we are using dplyr.

# * WHAT IF I ONLY WANT INFORMATION ABOUT THE SETOSA SPECIES?
# * HOW DO I NARROW DOWN MY DATAFRAME?
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris %>% filter(Species=="setosa")
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa
iris_subset <- iris %>% filter(Species=="setosa")

Scatterplots

In Lab 1, we made a bunch of barplots. Now, you’re making scatterplots. We did this in my Lab 1 Demo, but for review (see the Lab 1 Demo on YouTube or on my Git if you didn’t see this already), we only need a few things.

Notice: We are using = within functions to denote the values of our arguments. Why do we only sometimes put x=, y=, or data=? Because R assumes you’re entering values of the arguments in the same order as you would see under the help ? pane.

# * WHAT IS THE LIBRARY WE NEED?
library(ggplot2)

# * HOW DO WE START UP OUR "CANVAS"?
ggplot(data=iris_subset, mapping=aes(x=Sepal.Length, y=Sepal.Width))

# * WHAT FUNCTION DO WE USE TO ADD IN THE POINTS?
ggplot(iris_subset, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()

Regression

I’m not going to write any code on regression because there are only two functions that your professor goes over that you need in this portion. This should be an incentive to look over your lecture slides.

How would you define regression (layman’s terms)?

In our scatterplot, where do you suspect our line of best fit to appear?

What is the formula for a straight line? Do we always have to have a straight line?

Correlation does not imply ___________________. Example: Ice cream sales and sunblock sales

How do you interpret a regression line? For every +1 increase in X, we expect _____ in Y.

Simulation

We can run “simulations” in R. The simulations we’re doing is telling R that we have a distribution and that we want it to randomly choose values within that distribution for us. This is sampling from a known distribution.

# * RUNIF
# * HOW DO WE FIGURE OUT WHAT PARAMETERS A FUNCTION WANTS FROM US?
?runif

# * LET'S GO AND WRITE IT OUT THEN
our_sample <- runif(n = 100, min = 0, max = 10)

# * HOW DO WE PRINT WHAT IS IN THE SAMPLE?
our_sample
##   [1] 8.4493545 5.5239909 2.9523085 6.5674096 0.9608984 5.4948395 3.5816980
##   [8] 0.3639327 6.8212700 5.6823551 4.0625035 4.9801132 6.3684585 7.3071836
##  [15] 7.0443599 4.4823017 2.1730366 6.9248569 1.1386551 7.5140138 6.8423976
##  [22] 6.3768035 2.2965672 3.3253662 3.4527954 8.2015740 5.8434728 5.0198694
##  [29] 6.0777338 7.0899173 4.4380978 8.4941973 8.8109280 0.7935202 6.8819106
##  [36] 3.9444788 7.1867625 6.0522521 9.5438155 1.8252382 5.6173550 4.6894059
##  [43] 6.4206278 2.6325206 1.1769907 0.2503756 6.3060911 1.1554244 6.2960117
##  [50] 6.3010107 9.5297789 0.5840541 8.5003031 7.8456248 3.9077007 5.1201403
##  [57] 4.1706271 9.5965957 8.4001972 3.2631390 2.6733289 9.5219917 7.8525662
##  [64] 0.9333416 9.6388794 2.1313419 5.9758647 6.3346077 2.0574940 2.2470191
##  [71] 5.9879893 9.6334345 8.6955954 8.4971739 4.2358751 0.2896705 1.1302876
##  [78] 8.2853934 7.7000148 3.2993370 0.3755635 7.2829728 8.3995866 2.1381853
##  [85] 2.6417008 0.3621427 6.9115570 8.3757325 0.7163059 0.5329262 8.1345548
##  [92] 4.6914178 4.7467884 7.2565575 5.6735936 9.9830998 8.6001227 6.7273415
##  [99] 1.1279793 6.0363110
print(our_sample)
##   [1] 8.4493545 5.5239909 2.9523085 6.5674096 0.9608984 5.4948395 3.5816980
##   [8] 0.3639327 6.8212700 5.6823551 4.0625035 4.9801132 6.3684585 7.3071836
##  [15] 7.0443599 4.4823017 2.1730366 6.9248569 1.1386551 7.5140138 6.8423976
##  [22] 6.3768035 2.2965672 3.3253662 3.4527954 8.2015740 5.8434728 5.0198694
##  [29] 6.0777338 7.0899173 4.4380978 8.4941973 8.8109280 0.7935202 6.8819106
##  [36] 3.9444788 7.1867625 6.0522521 9.5438155 1.8252382 5.6173550 4.6894059
##  [43] 6.4206278 2.6325206 1.1769907 0.2503756 6.3060911 1.1554244 6.2960117
##  [50] 6.3010107 9.5297789 0.5840541 8.5003031 7.8456248 3.9077007 5.1201403
##  [57] 4.1706271 9.5965957 8.4001972 3.2631390 2.6733289 9.5219917 7.8525662
##  [64] 0.9333416 9.6388794 2.1313419 5.9758647 6.3346077 2.0574940 2.2470191
##  [71] 5.9879893 9.6334345 8.6955954 8.4971739 4.2358751 0.2896705 1.1302876
##  [78] 8.2853934 7.7000148 3.2993370 0.3755635 7.2829728 8.3995866 2.1381853
##  [85] 2.6417008 0.3621427 6.9115570 8.3757325 0.7163059 0.5329262 8.1345548
##  [92] 4.6914178 4.7467884 7.2565575 5.6735936 9.9830998 8.6001227 6.7273415
##  [99] 1.1279793 6.0363110

We can also visualize this…

# * MAKE SURE OUR LIBRARY IS LOADED
# * YOU ONLY REALLY NEED TO DO THIS ONCE
library(ggplot2)

# * MAKE A DATAFRAME
# * YOU DON'T NEED THIS RN, BUT GGPLOT DOES
our_sim_data <- data.frame(cbind(rep(0,100), our_sample))
names(our_sim_data)
## [1] "V1"         "our_sample"
# * A BASIC PLOT OF OUR SAMPLE
ggplot(our_sim_data, aes(x=our_sample, y=V1)) + geom_point()

# * SEE THE DISTRIBUTION A BIT BETTER
ggplot(our_sim_data, aes(x=our_sample, y=V1)) +
  geom_point(col=alpha("darkred", 0.3)) +
  xlab("Sampled Values") +
  ylab("") +
  ggtitle("Our Uniform Sample on [0,10]")

In your lab assignment, you’re going to be simulating “real” data. therefore, you’ll have to simulate some error. Start off with:

?rnorm

Vector Addition

Important for your assignemnt: when you simulate vectors of the same size, you can add them! Vectors are what you would expect them to be from your high school math courses. Let’s make a vector.

# * LET'S START UP OUR VECTORS
vector_1 <- c(1,1,1)
vector_2 <- c(2,2,0)

What do we expect our third vector to be if we add these two vectors together?

vector_1 + vector_2
## [1] 3 3 1
vector_3 <- vector_1 + vector_2

These were small vectors. Don’t be intimidated when your vectors are huge! The principle remains the same. Let’s go back to iris!

# * REMEMBER WHAT WE HAVE
names(iris_subset)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
# * CHECK THAT THEY ARE THE SAME LENGTH
length(iris_subset$Sepal.Length)
## [1] 50
length(iris_subset$Sepal.Width)
## [1] 50
# * ADD THEM UP
iris_subset$Sepal.Length + iris_subset$Sepal.Width
##  [1]  8.6  7.9  7.9  7.7  8.6  9.3  8.0  8.4  7.3  8.0  9.1  8.2  7.8  7.3
## [15]  9.8 10.1  9.3  8.6  9.5  8.9  8.8  8.8  8.2  8.4  8.2  8.0  8.4  8.7
## [29]  8.6  7.9  7.9  8.8  9.3  9.7  8.0  8.2  9.0  8.5  7.4  8.5  8.5  6.8
## [43]  7.6  8.5  8.9  7.8  8.9  7.8  9.0  8.3
# * SAVE TO A NEW VARIABLE FOR LATER USE
sepal_sum <- iris_subset$Sepal.Length + iris_subset$Sepal.Width

# * IN DPLYR
iris_subset %>% mutate(sepal_sum = Sepal.Length + Sepal.Width)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_sum
## 1           5.1         3.5          1.4         0.2  setosa       8.6
## 2           4.9         3.0          1.4         0.2  setosa       7.9
## 3           4.7         3.2          1.3         0.2  setosa       7.9
## 4           4.6         3.1          1.5         0.2  setosa       7.7
## 5           5.0         3.6          1.4         0.2  setosa       8.6
## 6           5.4         3.9          1.7         0.4  setosa       9.3
## 7           4.6         3.4          1.4         0.3  setosa       8.0
## 8           5.0         3.4          1.5         0.2  setosa       8.4
## 9           4.4         2.9          1.4         0.2  setosa       7.3
## 10          4.9         3.1          1.5         0.1  setosa       8.0
## 11          5.4         3.7          1.5         0.2  setosa       9.1
## 12          4.8         3.4          1.6         0.2  setosa       8.2
## 13          4.8         3.0          1.4         0.1  setosa       7.8
## 14          4.3         3.0          1.1         0.1  setosa       7.3
## 15          5.8         4.0          1.2         0.2  setosa       9.8
## 16          5.7         4.4          1.5         0.4  setosa      10.1
## 17          5.4         3.9          1.3         0.4  setosa       9.3
## 18          5.1         3.5          1.4         0.3  setosa       8.6
## 19          5.7         3.8          1.7         0.3  setosa       9.5
## 20          5.1         3.8          1.5         0.3  setosa       8.9
## 21          5.4         3.4          1.7         0.2  setosa       8.8
## 22          5.1         3.7          1.5         0.4  setosa       8.8
## 23          4.6         3.6          1.0         0.2  setosa       8.2
## 24          5.1         3.3          1.7         0.5  setosa       8.4
## 25          4.8         3.4          1.9         0.2  setosa       8.2
## 26          5.0         3.0          1.6         0.2  setosa       8.0
## 27          5.0         3.4          1.6         0.4  setosa       8.4
## 28          5.2         3.5          1.5         0.2  setosa       8.7
## 29          5.2         3.4          1.4         0.2  setosa       8.6
## 30          4.7         3.2          1.6         0.2  setosa       7.9
## 31          4.8         3.1          1.6         0.2  setosa       7.9
## 32          5.4         3.4          1.5         0.4  setosa       8.8
## 33          5.2         4.1          1.5         0.1  setosa       9.3
## 34          5.5         4.2          1.4         0.2  setosa       9.7
## 35          4.9         3.1          1.5         0.2  setosa       8.0
## 36          5.0         3.2          1.2         0.2  setosa       8.2
## 37          5.5         3.5          1.3         0.2  setosa       9.0
## 38          4.9         3.6          1.4         0.1  setosa       8.5
## 39          4.4         3.0          1.3         0.2  setosa       7.4
## 40          5.1         3.4          1.5         0.2  setosa       8.5
## 41          5.0         3.5          1.3         0.3  setosa       8.5
## 42          4.5         2.3          1.3         0.3  setosa       6.8
## 43          4.4         3.2          1.3         0.2  setosa       7.6
## 44          5.0         3.5          1.6         0.6  setosa       8.5
## 45          5.1         3.8          1.9         0.4  setosa       8.9
## 46          4.8         3.0          1.4         0.3  setosa       7.8
## 47          5.1         3.8          1.6         0.2  setosa       8.9
## 48          4.6         3.2          1.4         0.2  setosa       7.8
## 49          5.3         3.7          1.5         0.2  setosa       9.0
## 50          5.0         3.3          1.4         0.2  setosa       8.3
iris_subset <- iris_subset %>% mutate(sepal_sum = Sepal.Length + Sepal.Width)

Recap

You should know now how to make scatterplots (even though we did this last week), use runif and rnorm to simulate random numbers, and add up vectors.

We also know what linear regression is. It’s now up to you to put these tools together and to do linear regressions in R.