We'll continue with the rappers dataset to look at a bunch of scatterplots and consider what type of models we can use to represent the patterns we see in the data. Let's load in our libraries and our data. Notice that I'm using #comments
in my code. You can #COMMENT
like me in all caps or #comment
as is standard. In the following chunks, I will continue commenting in all capitals as a stylistic choice.
The following blocks look like a lot of code, but they aren't too scary.
Let's start off loading in the libraries dplyr
, ggplot2
, and broom
. Then, load in the data using read.csv()
.
# * LIBRARIES
library(dplyr)
library(ggplot2)
library(broom)
# * LOAD IN DATA
rappers <- read.csv("../data/rappers.csv")
The following chunk is just to convert height into decimal format.
# * DON'T WORRY ABOUT THIS CHUNK :D
height_to_decimal <- function(this_height) {
numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")
Now, we'll use mutate()
like we learned previously to append some new columns.
# * USING MUTATE TO ADD USEFUL VALUES
rappers <- rappers %>% mutate(height_decimal=height_to_decimal(height))
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_decimal))
rappers <- rappers %>% mutate(age=2019-birth_year,
active=2019-start_year)
head(rappers)
Seeing the head()
of your data won't tell you anything about its size. Using dim()
, we see we're working with 74 records of 11 variables.
dim(rappers)
Using str()
, we'll be able to tell what kind of data we have. Notice that we have a mix of variable types.
str(rappers)
Using the output from str()
, we see that we have 6 variables that are numeric (they have the labels int
or num
. We're going to probe the (pairwise) associations between some of these variables. One of the simplest yet most influential ways we can explore the relationship between two numeric variables is linear regression. You may have seen something like it before, i.e. $y=mx+b$. For that version of the equation, $y$ is the response variable, $m$ is the slope, and $b$ is the y-intercept.
But before we get there, we're going to draw scatterplots. Why? Well, you don't want to draw lines through patterns that really don't look like lines. You'll see what I mean soon.
Here's an example of two variables that have a strong relationship. Some words that we can describe the relationship are "approximately linear", "strong", and "positive". A rapper's age and the number of years active tends to have a natural linear relationship.
To make a plain scatterplot, we can do as follows.
ggplot(rappers, aes(x=age, y=active)) +
geom_point() +
ggtitle("Age v. Active Years in Rappers")
And although it may not be the most splendid thing to look at, we can color in the data points that have to do with rappers that have passed away. This may give us some insight into which data points should be considered in our pattern making.
ggplot(rappers, aes(x=age, y=active, color=deceased)) +
geom_point()
But taking it back to one color of dots, we can add in some sort of trend line to our data by using geom_smooth()
. Notice that by default, the function is using some method="loess"
. Without going too far into the math of it all, the loess method will draw a curve that follows the pattern of your data points by staying close enough to all of them at all times. (The machinery behind the method won't be tested!) The grey portion surrounding the blue line stands for the standard error (deviation) around the line estimate.
ggplot(rappers, aes(x=age, y=active)) +
geom_point() +
geom_smooth()
In this class, the most important smoothing method is "lm"
which stands for linear model! If you set your method to "lm"
, you'll notice that the line becomes straight. I turned se=FALSE
because the seeing the error around the line gets a bit funky to look at sometimes.
Later, we'll go over another way you can put a regression line on a graph.
ggplot(rappers, aes(x=age, y=active, color=deceased)) +
geom_point() +
geom_smooth(method="lm", se=FALSE, color="black")
We can write up our linear regression model using output from a handy dandy function called lm()
, which again, stands for "linear model". The format for the lm()
function is:
lm(data=your_dataframe, your_y_variable~your_x_variable)
I challenge you now to start thinking less in x
and y
and more in the realm of "independent" and "dependent".
model_1a <- lm(data=rappers, active~age)
model_1a
We are going to use tidy()
from the library broom
to see some more information about our linear model.
tidy(model_1a)
From just the information about, we can write the linear regression equation!
$$ \begin{align} y &= b_o + b_1x \\ &= -15.4073 + 0.9267x \end{align} $$And as alluded to before, we can draw our regression line manually on top of our scatterplot. While using geom_smooth()
works, sometimes you want some more control over what shows up on your plot. To demonstrate this, I'm going to add a random dotted line going across the plot as well.
ggplot(rappers, aes(x=age, y=active, color=deceased)) +
geom_point() +
geom_abline(slope=0.9267481, intercept=-15.4072512) + # * THE REGRESSION LINE
geom_abline(slope=1, intercept=-10, lty=2) # * AND THE RANDOM LINE
Correlation is a measure of the strength of association between two variables. It is denoted as $\textbf{R}$. The coefficient of determination $R^2$ is the square of correlation. Both $R$ and $R^2$ measure the goodness of fit of of your line to your data points. They are different in ways that are subtle to this course, but important to know.
Never fear: $R$ is correlation and $R^2$ is the coefficient of determination, and the notation is straightforward! Indeed, if we square correlation $R$, then we get $R^2$!
In this document, I am writing correlation to be capital R just to prove a point. It is normally denoted by lowercase r or $\rho$, for the population correlation.
rappers %>% summarize(correlation=cor(age, active),
correlation_sq=correlation^2)
Two more functions you can use to look into your linear model are summary()
, which is more base R, and glance()
, which is (for the lack of a better expression) more hip. Take a look at values of "R-squared" in the summary of model 1a.
summary(model_1a)
To measure our model's goodness of fit, we will look at $R^2$. The $R^2$ above says Multiple R-squared: 0.8304
, as in $R^2=0.8304$. It's important to note that $R^2$ is a number between 0 and 1. A value of $R^2=1$ says that the data lie in a straight line. A value of $R^2=0$ means that the data show no association.
One property of $R^2$ shows that you can switch your independent and dependent variables and $R^2$ will remain the same. Check it out below. We're going to let age be the y
variable this time.
model_1b <- lm(data=rappers, age~active)
glance(model_1b)
As we suspected, $R^2=0.8304309$.
Not all data is suitable to draw a line through. We have to be reasonable when drawing a line through a graph, especially since we technically are trying to predict for values. We cannot predict new values from a line that doesn't follow the pattern of our existing dataset.
Remember that facet_wrap()
can separate data based on some categorical variable in your dataset. Here, we're separating by state of origin. Look at the graphs below. Notice that putting a linear model on a lot of them doesn't look quite helpful in describing the association between the variables.
The California graph doesn't really look linear. It looks like there are two clusters instead, which a line cannot describe. Several of these graphs include only 1 or 2 data points. We shouldn't be making predictions off of just two data points in the first place, so don't do a regression on those in practice either!
ggplot(rappers, aes(x=age, y=active, color=deceased)) +
geom_smooth(method="lm", se=FALSE, color="black") +
geom_point() +
facet_wrap(~origin)
Here's another plot that is not so helpful. it plots age versus height. From the get go, the data look like "random scatter", so we cannot assume a shape/pattern!
ggplot(rappers, aes(x=age, y=height_decimal)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
Notice how low the $R^2$ value is!
model_2 <- lm(data=rappers, height_decimal~age)
glance(model_2)
Not all rappers are making the same money. In fact, four of the rappers we have included in our dataset are making bank with a net worth of $750 million dollars or more. We can check if they're statistical outliers via boxplot. According to the below boxplot, 9 artists' net worths are outliers!
ggplot(rappers, aes(y=net_worth)) +
geom_boxplot()
These outliers happen to be influential points. They pull the line of best fit away from the other points because linear regression uses means, and means are not robust measures of centrality! (Outliers will pull the mean away from the median. Recall skewness!)
ggplot(rappers, aes(x=height_decimal, y=net_worth)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
I'm practicing a little subjectivity and removing artists worth $750 million or more (this sounds rude, but I don't want to be that way; no life is worth a cash value in my heart) by using filter()
.
less_than_750 <- rappers %>% filter(net_worth<750)
We started off with a dataset of 74 rappers. We have successfully removed four. We can check by using dim()
.
dim(less_than_750)
When we scatterplot these data, notice that our scale changes a lot on the y-axis. It looks to me as though we shouldn't be plotting height against net worth. It doesn't make sense in general, but as well as on the plot! However, we can just go ahead and plot the points below to show that we can't see patterns.
ggplot(less_than_750, aes(x=height_decimal, y=net_worth)) +
geom_point()
Something I didn't go over is transformations. Sometimes, you'll want to transform your data so that it may look more linear. You'd want to do this if you start seeing a pattern in your data that you think you can remove by applying a function. In this case, I didn't see it, so I didn't want to apply a function.