</font>
This is due on the 7th on 5pm. Your proposals will be looked over. If you're not told otherwise, you're good.
Heavily focused on Part III, but the subject matter is cumulative
Review sessions are "valuable"
Same as before. 1 page, same guidelines. You'll have an equation sheet... but you'll have to recognize these things. I apologize?
There is more code than is necessary for the exam within this notebook... pay attention only to the following.
We pnorm
and qnorm
to calculate p
robabilities and q
uantiles for a normal distribution. You also need pt
, qt
, and pchisq
.
tidy
glance
augment
Don't augment the dataframe. Augment the linear model.
We use lm
to make l
inear m
odels.
You need aov
for your F-test. (See ANOVA below.)
You need predict
from the world of linear regression. (See lab.)
</font>
ggplot2
dplyr
</font>
ANOVA is nothing totally different than what you're using to. What this means for you is that we're doing the F-test. Like the other tests, the name in front of the test is the name of a distribution.
We will be looking at data that look like this. </font>
groups <- cbind("group_1", "group_2", "group_3", "group_k")
values <- lapply(1:4, function(x) runif(15, 0, 100))
values <- data.frame(do.call(cbind, values))
names(values) <- groups
values
They have means.
values %>% summarize_all(mean)
means <- values %>% summarize_all(mean)
var(unlist(means[1,]))
They have variances.
values %>% summarize_all(var)
You don't need to know this, but the F distribution is ratio between two $\chi^2$ distributions. Thus, it's a positive distribution and (this you need to know) has two degrees of freedom.
We can test the following hypotheses.
$H_o: \, \mu_k$ is the same for all $k$
$H_1: \, \mu_k \neq 0$ for at least one $k$
Yes, the alternative is awkward. Live with it. </font>
If the means of all the groups are similar, then the between group variance is small. If the variance of a single group or multiple groups is large, then we expect a larger within group variance.
$F= \frac {\text{variability BETWEEN groups}} {\text {variability WITHIN groups}}$ </font>
See bCourses Ch. 24 on the last few pages. These are extensively detailed. In short,
</font>
library(reshape2)
melt_values <- melt(values)
model <- lm(value ~ variable, data=melt_values)
model
# melt_values
library(broom)
# aov(lm(y~x))
anova <- aov(model)
tidy(anova)
We can get the degrees of freedom with k-1 and N-k.
pf(0.8444, df1=4-1, df2=nrow(melt_values)-4, lower.tail=F)
Material overload. Here's Tukey's HSD. We're going to do pairwise testing between each of the groups to see if the means are the same or not. Notice these are adjusted/corrected for "multiple testing".
# TukeyHSD(aov(lm(y~x)))
many_tests <- TukeyHSD(anova, conf.level=0.05) %>% tidy()
many_tests
rapper_df
Before you make a linear model, you're gonna wanna check if the data look somewhat linear.
library(ggplot2)
ggplot(rapper_df, aes(x=age, y=net_worth)) + geom_point()
Honestly, these data don't look so linear. We're going to run a model through it for education purposes though.
We can fit a linear model of age and net worth using `lm`.
library(dplyr)
net_worth <- rapper_df %>% pull(net_worth)
age <- rapper_df %>% pull(age)
model <- lm(net_worth ~ age)
Check out what `glance` does.
library(broom)
glance(model)
library(broom)
tidy(model)
Because of formal hypothesis testing. We'll get to this soon.
We have $y=mx+b$ with $m=5.9601$ and $b=-146.4278$.
According to our data, we see that as an artist gets older by a year, the artist's net worth increases by 5.96 million dollars.
We'll use `augment` to get fitted values and residuals. Pay attention to how I renamed the columns so it's not super confusing.
augmented_model <- augment(model)
augmented_model <- augmented_model %>%
select(net_worth, age, .fitted, .resid) %>%
rename(fitted=.fitted,
residual=.resid)
head(augmented_model)
ggplot(augmented_model, aes(y=net_worth, x=age)) +
geom_segment(aes(xend=age, yend=fitted), color="darkgrey") +
geom_smooth(method="lm", color="black") +
geom_point() +
ggtitle("Peep the residuals")
ggplot(augmented_model, aes(sample=residual)) +
geom_qq_line(col="black") +
geom_qq() +
ggtitle("QQ plot of residuals")
ggplot(augmented_model, aes(y=residual, x=fitted)) +
geom_point(col="midnightblue") +
geom_hline(aes(yintercept = 0)) +
ggtitle("Shows patterns between the fitted line and predicted values")
library(tidyr)
reshape <- augmented_model %>% dplyr::select(residual, net_worth) %>%
gather(key="type", value="value", net_worth, residual)
ggplot(reshape, aes(y=value)) +
geom_boxplot(aes(fill = type)) +
ggtitle("Look at the variation in residuals and in the response") +
scale_fill_manual(values = c("darkgrey", "gold"))
These assumptions are different than the ones in the book.
These are assumptions for $\beta_1$.
$H_o: \, \beta_1 = 0$
$H_1: \, \beta_1 \neq 0$
Why would we do this? </font>
tidy(model)
We will test the same hypotheses. The format for confidence interval remains the same. It is the interval created by the mean plus/minus the error term.
This is "by hand". </font>
t_star <- qt(p=0.975, df=nrow(rapper_df)-2)
t_star
So, estimate +/- t_starstd.error. That is, $(5.906) \text{+/-} (-2.0553.303)$.
The following is "using R". Why am I checking the ranges? </font>
range(rapper_df$age)
new_ages <- c(20, 30, 40, 50)
Recall the model...
model
predict(model, newdata=data.frame(age=new_ages), interval = "confidence")
Say we want to predict for new data. It is based off the linear model only, unlike the CI which must contain the true value 95% of the time. PI is wider.
predict(model, newdata=data.frame(age=new_ages), interval = "predict")
# * #
# COLUMN BY COLUMN #
# * #
names = c("Nicki Minaj",
"Jay Z",
"Eminem",
"Kendrick Lamar",
"Logic",
"E-40",
"Nas",
"Jadakiss",
"Chance the Rapper",
"Childish Gambino",
"Kanye West",
"Cardi B"
)
real_names = c("Onika Maraj",
"Shawn Carter",
"Marshall Mathers",
"Kendrick Duckworth",
"Robert Hall",
"Earl Stevens",
"Nasir Jones",
"Jason Phillips",
"Chancelor Bennett",
"Donald Glover",
"Kanye West",
"Belcalis Almanzar"
)
age = c(35, 48, 45, 31, 28, 50, 45, 43, 25, 34, 41, 25)
state = c("New York",
"New York",
"Missouri",
"California",
"Maryland",
"California",
"New York",
"New York",
"Illinois",
"California",
"Georgia",
"New York"
)
worth = c(75, 900, 190, 45, 10, 10, 50, 6, 33, 12, 160, 8)
start = c(2004, 1986, 1988, 2003, 2009, 1986, 1991, 1991, 2011, 2002, 1996, 2015)
rapper_df = data.frame(artist_name=names, legal_name=real_names, age=age, origin=state, net_worth=worth, start_year=start)
# * #
# ROW BY ROW #
# * #
adds = rbind(
c("G-Eazy", "Gerald Gillum", 29, "California", 9, 2008),
c("2Pac", "Tupac Shakur", 49, "California", 40, 1987),
c("YG", "Keenon Jackson", 28, "California", 3, 2009),
c("ScHoolboy Q", "Quincy Hanley", 31, "California", 4, 2008),
c("Lupe Fiasco", "Wasalu Jaco", 36, "Chicago", 14, 2000),
c("Drake", "Aubrey Graham", 31, "International", 100, 2001),
c("Joey BadA$$", "Jo-Vaughn Scott", 23, "New York", 4, 2010),
c("Snoop Dogg", "Calvin Broadus", 46, "California", 135, 1991),
c("Iamsu!", "Sudan Williams", 28, "California", 1, 2007),
c("J. Cole", "Jermaine Cole", 33, "New York", 15, 2007),
c("Gucci Mane", "Radric Davis", 38, "Georgia", 12, 2001),
c("Rich Brian", "Brian Imanuel", 19, "International", 0.3, 2015),
c("Too $hort", "Todd Shaw", 52, "California", 15, 1984),
c("Mac Dre", "Andre Hicks", 48, "California", 1.5, 1984),
c("Missy Elliott", "Melissa Elliott", 47, "Virginia", 50, 1989),
c("Notorious B.I.G.", "Christopher Wallace", 46, "New York", 160, 1992)
)
adds_df = data.frame(adds)
names(adds_df) = names(rapper_df)
rapper_df = rbind(rapper_df, adds_df)
rapper_df$age = as.numeric(rapper_df$age)
rapper_df$net_worth = as.numeric(rapper_df$net_worth)
rapper_df$start_year = as.numeric(rapper_df$start_year)