Last lab!¶

Office Hours¶

Tomorrow, I'll go 9:00-10:30 AM. Come with questions.
1. I have discussion from 2:10-3:00 in Wheeler 124 as well.
2. Next week, 9:00-10:30 AM on Friday as well. No Monday.

Infographic¶

This is due on the 7th on 5pm. Your proposals will be looked over. If you're not told otherwise, you're good.

Review Sessions¶

Monday, Wednesday, Friday at 8-9 AM in Li Ka Shing.
Practice questions are released under Final Exam Review folder on bCourses.

Content¶

Heavily focused on Part III, but the subject matter is cumulative
Review sessions are "valuable"

"Crib sheet"¶

Same as before. 1 page, same guidelines. You'll have an equation sheet... but you'll have to recognize these things. I apologize?

Code Necessities¶

There is more code than is necessary for the exam within this notebook... pay attention only to the following.

Distribution Functions¶

We pnorm and qnorm to calculate probabilities and quantiles for a normal distribution. You also need pt, qt, and pchisq.

tidy
glance
augment Don't augment the dataframe. Augment the linear model.

We use lm to make linear models.
You need aov for your F-test. (See ANOVA below.)
You need predict from the world of linear regression. (See lab.)

Interpretation¶

ggplot2
dplyr

New this week. Analysis of Variance.¶

ANOVA is nothing totally different than what you're using to. What this means for you is that we're doing the F-test. Like the other tests, the name in front of the test is the name of a distribution.

We will be looking at data that look like this.

groups <- cbind("group_1", "group_2", "group_3", "group_k")

values <- lapply(1:4, function(x) runif(15, 0, 100))
values <- data.frame(do.call(cbind, values))
                 
names(values) <- groups
values

They have means.

values %>% summarize_all(mean)

means <- values %>% summarize_all(mean)
var(unlist(means[1,]))

They have variances.

values %>% summarize_all(var)

F Distribution¶

You don't need to know this, but the F distribution is ratio between two $\chi^2$ distributions. Thus, it's a positive distribution and (this you need to know) has two degrees of freedom.

F Test (For equal means among k groups)¶

We can test the following hypotheses.

$H_o: \, \mu_k$ is the same for all $k$
$H_1: \, \mu_k \neq 0$ for at least one $k$

Yes, the alternative is awkward. Live with it.

F Statistic¶

If the means of all the groups are similar, then the between group variance is small. If the variance of a single group or multiple groups is large, then we expect a larger within group variance.

$F= \frac {\text{variability BETWEEN groups}} {\text {variability WITHIN groups}}$

Assumptions¶

See bCourses Ch. 24 on the last few pages. These are extensively detailed. In short,

SRS
Normal would be nice
Same standard deviation between pops

library(reshape2)
melt_values <- melt(values)

model <- lm(value ~ variable, data=melt_values)
model

No id variables; using all as measure variables

Call:
lm(formula = value ~ variable, data = melt_values)

Coefficients:
    (Intercept)  variablegroup_2  variablegroup_3  variablegroup_k  
          39.93            10.48             6.72            17.70

# melt_values

library(broom)
# aov(lm(y~x))
anova <- aov(model)
tidy(anova)

We can get the degrees of freedom with k-1 and N-k.

pf(0.8444, df1=4-1, df2=nrow(melt_values)-4, lower.tail=F)

Material overload. Here's Tukey's HSD. We're going to do pairwise testing between each of the groups to see if the means are the same or not. Notice these are adjusted/corrected for "multiple testing".

# TukeyHSD(aov(lm(y~x)))
many_tests <- TukeyHSD(anova, conf.level=0.05) %>% tidy()
many_tests

References.¶

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test

The lab. Regression test.¶

Return to the Rappers Dataset¶

For those of you who watched my midterm 1 review vids, this dataframe should be familiar. For those of you who want to run this at home, run the very last chunk first.

rapper_df

Scatterplot¶

Before you make a linear model, you're gonna wanna check if the data look somewhat linear.

library(ggplot2)
ggplot(rapper_df, aes(x=age, y=net_worth)) + geom_point()

Honestly, these data don't look so linear. We're going to run a model through it for education purposes though.

Fit the model¶

We can fit a linear model of age and net worth using `lm`.

library(dplyr)
net_worth <- rapper_df %>% pull(net_worth)
age <- rapper_df %>% pull(age)
model <- lm(net_worth ~ age)

Check out what `glance` does.

library(broom)
glance(model)

library(broom)
tidy(model)

Why do we have p-values in our output?¶

Because of formal hypothesis testing. We'll get to this soon.

Write the model¶

We have $y=mx+b$ with $m=5.9601$ and $b=-146.4278$.

Interpret the parameters¶

According to our data, we see that as an artist gets older by a year, the artist's net worth increases by 5.96 million dollars.

Calculate associated regression values¶

We'll use `augment` to get fitted values and residuals. Pay attention to how I renamed the columns so it's not super confusing.

augmented_model <- augment(model)
augmented_model <- augmented_model %>%
                    select(net_worth, age, .fitted, .resid) %>%
                    rename(fitted=.fitted,
                           residual=.resid)
head(augmented_model)

We can make so many plots with this dataframe¶

ggplot(augmented_model, aes(y=net_worth, x=age)) +
    geom_segment(aes(xend=age, yend=fitted), color="darkgrey") +
    geom_smooth(method="lm", color="black") +
    geom_point() +
    ggtitle("Peep the residuals")

ggplot(augmented_model, aes(sample=residual)) + 
    geom_qq_line(col="black") +
    geom_qq() +
    ggtitle("QQ plot of residuals")

ggplot(augmented_model, aes(y=residual, x=fitted)) + 
    geom_point(col="midnightblue") +
    geom_hline(aes(yintercept = 0)) +
    ggtitle("Shows patterns between the fitted line and predicted values")

library(tidyr)

reshape <- augmented_model %>% dplyr::select(residual, net_worth) %>%
              gather(key="type", value="value", net_worth, residual)

ggplot(reshape, aes(y=value)) +
  geom_boxplot(aes(fill = type)) +
  ggtitle("Look at the variation in residuals and in the response") +
  scale_fill_manual(values = c("darkgrey", "gold"))

Assumptions¶

These assumptions are different than the ones in the book.

$x, y$ linear in scatterplot
Residuals are normally distributed.
Independent observations
Standard deviation of the response variables are the same for all values of x

Hypothesis Testing¶

These are assumptions for $\beta_1$.

$H_o: \, \beta_1 = 0$
$H_1: \, \beta_1 \neq 0$

Why would we do this?

tidy(model)

Confidence Intervals¶

We will test the same hypotheses. The format for confidence interval remains the same. It is the interval created by the mean plus/minus the error term.

This is "by hand".

t_star <- qt(p=0.975, df=nrow(rapper_df)-2)
t_star

So, estimate +/- t_starstd.error. That is, $(5.906) \text{+/-} (-2.0553.303)$.

The following is "using R". Why am I checking the ranges?

range(rapper_df$age)

new_ages <- c(20, 30, 40, 50)

Recall the model...

model

Call:
lm(formula = net_worth ~ age)

Coefficients:
(Intercept)          age  
    -146.43         5.96

predict(model, newdata=data.frame(age=new_ages), interval = "confidence")

Prediction Intervals¶

Say we want to predict for new data. It is based off the linear model only, unlike the CI which must contain the true value 95% of the time. PI is wider.

predict(model, newdata=data.frame(age=new_ages), interval = "predict")

Take home from this¶

Read this all over at home. So, take this document itself home.
Be able to interpret output.
Make sure you have written down all your assumptions. At this point, you don't want to be confusing tests... Make a nice table for yourself.
T-star, T-critical value, T-statistic. ???

Preface: Load in data¶

# *                          #
#      COLUMN BY COLUMN      #
#                          * #

names = c("Nicki Minaj",
          "Jay Z",
          "Eminem",
          "Kendrick Lamar",
          "Logic",
          "E-40",
          "Nas",
          "Jadakiss",
          "Chance the Rapper",
          "Childish Gambino",
          "Kanye West",
          "Cardi B"
          )

real_names = c("Onika Maraj",
               "Shawn Carter",
               "Marshall Mathers",
               "Kendrick Duckworth",
               "Robert Hall",
               "Earl Stevens",
               "Nasir Jones",
               "Jason Phillips",
               "Chancelor Bennett",
               "Donald Glover",
               "Kanye West",
               "Belcalis Almanzar"
               )

age = c(35, 48, 45, 31, 28, 50, 45, 43, 25, 34, 41, 25)

state = c("New York",
          "New York",
          "Missouri",
          "California",
          "Maryland",
          "California",
          "New York",
          "New York",
          "Illinois",
          "California",
          "Georgia",
          "New York"
          )

worth = c(75, 900, 190, 45, 10, 10, 50, 6, 33, 12, 160, 8)

start = c(2004, 1986, 1988, 2003, 2009, 1986, 1991, 1991, 2011, 2002, 1996, 2015)

rapper_df = data.frame(artist_name=names, legal_name=real_names, age=age, origin=state, net_worth=worth, start_year=start)



# *                     #
#      ROW BY ROW       #
#                     * #

adds = rbind(
      c("G-Eazy", "Gerald Gillum", 29, "California", 9, 2008),
      c("2Pac", "Tupac Shakur", 49, "California", 40, 1987),
      c("YG", "Keenon Jackson", 28, "California", 3, 2009),
      c("ScHoolboy Q", "Quincy Hanley", 31, "California", 4, 2008),
      c("Lupe Fiasco", "Wasalu Jaco", 36, "Chicago", 14, 2000),
      c("Drake", "Aubrey Graham", 31, "International", 100, 2001),
      c("Joey BadA$$", "Jo-Vaughn Scott", 23, "New York", 4, 2010),
      c("Snoop Dogg", "Calvin Broadus", 46, "California", 135, 1991),
      c("Iamsu!", "Sudan Williams", 28, "California", 1, 2007),
      c("J. Cole", "Jermaine Cole", 33, "New York", 15, 2007),
      c("Gucci Mane", "Radric Davis", 38, "Georgia", 12, 2001),
      c("Rich Brian", "Brian Imanuel", 19, "International", 0.3, 2015),
      c("Too $hort", "Todd Shaw", 52, "California", 15, 1984),
      c("Mac Dre", "Andre Hicks", 48, "California", 1.5, 1984),
      c("Missy Elliott", "Melissa Elliott", 47, "Virginia", 50, 1989),
      c("Notorious B.I.G.", "Christopher Wallace", 46, "New York", 160, 1992)
      )

adds_df = data.frame(adds)
names(adds_df) = names(rapper_df)
rapper_df = rbind(rapper_df, adds_df)

rapper_df$age = as.numeric(rapper_df$age)
rapper_df$net_worth = as.numeric(rapper_df$net_worth)
rapper_df$start_year = as.numeric(rapper_df$start_year)

group_1	group_2	group_3	group_k
1.078974	45.853776	67.3876087	89.060001
43.485962	25.347964	84.1581032	50.267509
96.448578	2.655742	42.5167292	67.746999
11.493893	54.023102	21.3142793	49.930421
90.283501	79.027486	65.2763417	20.322345
64.433351	28.327148	93.0615224	5.358334
71.221161	58.219872	46.5050540	83.879548
16.947358	63.598982	19.0725306	90.537611
41.576571	55.409839	30.0676575	14.745989
36.899950	48.258574	21.3380388	95.837700
42.777855	93.853638	91.5993112	99.669502
1.743857	84.365987	82.1951139	20.316152
5.472901	25.160614	11.8597290	8.757535
18.847923	21.078640	0.5562568	90.611898
56.198888	70.951467	22.8036103	77.329258

term	df	sumsq	meansq	statistic	p.value
variable	3	2456.001	818.6670	0.8442988	0.475443
Residuals	56	54299.913	969.6413	NA	NA

term	comparison	estimate	conf.low	conf.high	adj.p.value
variable	group_2-group_1	10.481474	4.3919638	16.570984	0.7933050
variable	group_3-group_1	6.720078	0.6305675	12.809588	0.9343905
variable	group_k-group_1	17.697339	11.6078286	23.786849	0.4115235
variable	group_3-group_2	-3.761396	-9.8509065	2.328114	0.9873787
variable	group_k-group_2	7.215865	1.1263546	13.305375	0.9204055
variable	group_k-group_3	10.977261	4.8877509	17.066771	0.7695074

term	estimate	std.error	statistic	p.value
(Intercept)	-146.427843	125.868666	-1.163338	0.25525806
age	5.960135	3.303033	1.804443	0.08275907

term	estimate	std.error	statistic	p.value
(Intercept)	-146.427843	125.868666	-1.163338	0.25525806
age	5.960135	3.303033	1.804443	0.08275907

artist_name	legal_name	age	origin	net_worth	start_year
Nicki Minaj	Onika Maraj	35	New York	75.0	2004
Jay Z	Shawn Carter	48	New York	900.0	1986
Eminem	Marshall Mathers	45	Missouri	190.0	1988
Kendrick Lamar	Kendrick Duckworth	31	California	45.0	2003
Logic	Robert Hall	28	Maryland	10.0	2009
E-40	Earl Stevens	50	California	10.0	1986
Nas	Nasir Jones	45	New York	50.0	1991
Jadakiss	Jason Phillips	43	New York	6.0	1991
Chance the Rapper	Chancelor Bennett	25	Illinois	33.0	2011
Childish Gambino	Donald Glover	34	California	12.0	2002
Kanye West	Kanye West	41	Georgia	160.0	1996
Cardi B	Belcalis Almanzar	25	New York	8.0	2015
G-Eazy	Gerald Gillum	29	California	9.0	2008
2Pac	Tupac Shakur	49	California	40.0	1987
YG	Keenon Jackson	28	California	3.0	2009
ScHoolboy Q	Quincy Hanley	31	California	4.0	2008
Lupe Fiasco	Wasalu Jaco	36	Chicago	14.0	2000
Drake	Aubrey Graham	31	International	100.0	2001
Joey BadA$$	Jo-Vaughn Scott	23	New York	4.0	2010
Snoop Dogg	Calvin Broadus	46	California	135.0	1991
Iamsu!	Sudan Williams	28	California	1.0	2007
J. Cole	Jermaine Cole	33	New York	15.0	2007
Gucci Mane	Radric Davis	38	Georgia	12.0	2001
Rich Brian	Brian Imanuel	19	International	0.3	2015
Too $hort	Todd Shaw	52	California	15.0	1984
Mac Dre	Andre Hicks	48	California	1.5	1984
Missy Elliott	Melissa Elliott	47	Virginia	50.0	1989
Notorious B.I.G.	Christopher Wallace	46	New York	160.0	1992

net_worth	age	fitted	residual
75	35	62.17688	12.823118
900	48	139.65864	760.341362
190	45	121.77823	68.221767
45	31	38.33634	6.663658
10	28	20.45594	-10.455937
10	50	151.57891	-141.578908

	fit	lwr	upr
1	-27.22514	-158.70290	104.2526
2	32.37621	-46.92674	111.6792
3	91.97756	24.81412	159.1410
4	151.57891	42.25228	260.9055

	fit	lwr	upr
1	-27.22514	-389.7388	335.2885
2	32.37621	-314.6378	379.3902
3	91.97756	-252.4650	436.4201
4	151.57891	-203.5015	506.6593