How it Thursday, again?! I guess we can do a #TBT. Did you ever plot some illustrations doing Graphiti in elementary school? When's the last time you had to make a scatterplot? Was it after a science lab? Was it for fun (like me?) Was it by hand? Let's learn how to do it in R on the computer.
What did we do last time?
ggplot2
and dplyr
syntaxdplyr
summary statistics and column manipulationgeom_histogram()
and geom_bar()
This lab will get you drawing dots and lines in R. (We won't be plotting TV's like on the right, but we'll be plotting cool numbers indeed!)
library(dplyr)
library(ggplot2)
library(broom)
We'll load in data and mutate it the same way as we did in our previous assignments (by creating CS_rate_100
).
CS_data <- read.csv("rstudio-export/Data/Cesarean.csv")
CS_data$Income_Group <- forcats::fct_relevel(CS_data$Income_Group,
"Low income", "Lower middle income",
"Upper middle income", "High income: nonOECD",
"High income: OECD")
head(CS_data)
CS_data <- CS_data %>% mutate(CS_rate_100 = CS_rate*100)
head(CS_data)
ggplot(CS_data, aes(x=CS_rate_100)) +
geom_histogram(binwidth=2, color="white")
This does not look like data we can really draw a line through. While there's a lot of data concentrated to the left of the graph, a good amount is far away from that as well.
ggplot(CS_data, aes(x=GDP_2006, y=CS_rate_100)) + geom_point()
We want to see if applying natural log to our data will make it look linear enough to draw a linear regression line on our data.
CS_data <- CS_data %>% mutate(log_CS=log(CS_rate_100),
log_GDP=log(GDP_2006))
names(CS_data)
Here's the plot. It sorta looks linear.
ggplot(CS_data, aes(x=log_GDP, y=log_CS)) +
geom_point()
But it looks more curved. Check out how we're using geom_smooth()
.
ggplot(CS_data, aes(x=log_GDP, y=log_CS)) + geom_point() + geom_smooth()
Something interesting to us might be to color our data points by income group. It may "unlock" some more patterns for us to explore. It seems that the high income data are not following the same upward trend as the low and middle income data points.
ggplot(CS_data, aes(x=log_GDP, y=log_CS)) +
geom_point(aes(color=Income_Group))
We want to pretty much remove all the high income data points from our dataset to see if we can generate a better linear model. We use filter()
to do this. The syntax is a little different than what you've seen so far. Let's break it down.
This line filter(Income_Group %in% c("Low Income", "Lower middle income", "Upper middle income")
means we only want rows that have either "Low Income"
, "Lower middle income"
, or Upper middle income
associated with them.
The c()
function is makes a "vector" (like what you've seen in calculus), but with words. If you need another metaphor, then this means we want only the rows that have a value that is within our vector bucket that includes those three income values.
CS_data_sub <- CS_data %>%
filter(Income_Group %in% c("Low Income", "Lower middle income", "Upper middle income"))
head(CS_data)
dim(CS_data)
dim(CS_data_sub)
We can draw a straight line through these data, but at the end of the day, the line still doesn't fit the pattern that well (if there is one.)
ggplot(CS_data_sub, aes(x=log_GDP, y=log_CS)) + geom_point() + geom_smooth(method="lm")
We make a linear model using lm()
.
# lm(y_variable ~ x_variable, data=whatever_your_dataframe_is)
mod_1 <- lm(log_CS~log_GDP, data=CS_data_sub)
The tidy()
function makes the output easier for us to read.
tidy(mod_1)
Interpretation of the intercept If (x-variable) log GDP were to be 0, then the (y-variable) log CS rate would be -2.7857853.
Interpretation of the slope If we increase (x-variable) by one unit of log GDP, then the (y-variable) log CS rate would increase by 0.6772512 log GDP units.
And glance()
gives us $R^2$.
glance(mod1)
A better (I think) way to draw our regression line on our plot is by using geom_abline()
.
ggplot(CS_data_sub, aes(x=log_GDP, y=log_CS)) +
geom_abline(slope=0.6772512, intercept=-2.7857853) +
geom_point()