Last lab!

Office Hours

  1. Tomorrow, I'll go 9:00-10:30 AM. Come with questions.
    1. I have discussion from 2:10-3:00 in Wheeler 124 as well.
    2. Next week, 9:00-10:30 AM on Friday as well. No Monday.

</font>

Infographic

This is due on the 7th on 5pm. Your proposals will be looked over. If you're not told otherwise, you're good.

Review Sessions

  1. Monday, Wednesday, Friday at 8-9 AM in Li Ka Shing.

  2. Practice questions are released under Final Exam Review folder on bCourses. </font>

Content

Heavily focused on Part III, but the subject matter is cumulative
Review sessions are "valuable"

"Crib sheet"

Same as before. 1 page, same guidelines. You'll have an equation sheet... but you'll have to recognize these things. I apologize?

Code Necessities

There is more code than is necessary for the exam within this notebook... pay attention only to the following.

Distribution Functions

We pnorm and qnorm to calculate probabilities and quantiles for a normal distribution. You also need pt, qt, and pchisq.

tidy
glance
augment Don't augment the dataframe. Augment the linear model.

We use lm to make linear models.
You need aov for your F-test. (See ANOVA below.)
You need predict from the world of linear regression. (See lab.) </font>

Interpretation

ggplot2
dplyr
</font>

New this week. Analysis of Variance.

ANOVA is nothing totally different than what you're using to. What this means for you is that we're doing the F-test. Like the other tests, the name in front of the test is the name of a distribution.

We will be looking at data that look like this. </font>

In [56]:
groups <- cbind("group_1", "group_2", "group_3", "group_k")

values <- lapply(1:4, function(x) runif(15, 0, 100))
values <- data.frame(do.call(cbind, values))
                 
names(values) <- groups
values
group_1group_2group_3group_k
1.078974 45.853776 67.387608789.060001
43.485962 25.347964 84.158103250.267509
96.448578 2.655742 42.516729267.746999
11.493893 54.023102 21.314279349.930421
90.283501 79.027486 65.276341720.322345
64.433351 28.327148 93.0615224 5.358334
71.221161 58.219872 46.505054083.879548
16.947358 63.598982 19.072530690.537611
41.576571 55.409839 30.067657514.745989
36.899950 48.258574 21.338038895.837700
42.777855 93.853638 91.599311299.669502
1.743857 84.365987 82.195113920.316152
5.472901 25.160614 11.8597290 8.757535
18.847923 21.078640 0.556256890.611898
56.198888 70.951467 22.803610377.329258

They have means.

In [61]:
values %>% summarize_all(mean)
group_1group_2group_3group_k
39.9273850.4088646.6474657.62472
In [70]:
means <- values %>% summarize_all(mean)
var(unlist(means[1,]))
54.5778004508478

They have variances.

In [63]:
values %>% summarize_all(var)
group_1group_2group_3group_k
967.3459676.6959990.04341244.48

F Distribution

You don't need to know this, but the F distribution is ratio between two $\chi^2$ distributions. Thus, it's a positive distribution and (this you need to know) has two degrees of freedom.

F Test (For equal means among k groups)

We can test the following hypotheses.

$H_o: \, \mu_k$ is the same for all $k$
$H_1: \, \mu_k \neq 0$ for at least one $k$

Yes, the alternative is awkward. Live with it. </font>

F Statistic

If the means of all the groups are similar, then the between group variance is small. If the variance of a single group or multiple groups is large, then we expect a larger within group variance.

$F= \frac {\text{variability BETWEEN groups}} {\text {variability WITHIN groups}}$ </font>

Assumptions

See bCourses Ch. 24 on the last few pages. These are extensively detailed. In short,

  1. SRS
  2. Normal would be nice
  3. Same standard deviation between pops

</font>

In [131]:
library(reshape2)
melt_values <- melt(values)

model <- lm(value ~ variable, data=melt_values)
model
No id variables; using all as measure variables
Call:
lm(formula = value ~ variable, data = melt_values)

Coefficients:
    (Intercept)  variablegroup_2  variablegroup_3  variablegroup_k  
          39.93            10.48             6.72            17.70  
In [133]:
# melt_values
In [82]:
library(broom)
# aov(lm(y~x))
anova <- aov(model)
tidy(anova)
termdfsumsqmeansqstatisticp.value
variable 3 2456.001818.6670 0.84429880.475443
Residuals56 54299.913969.6413 NA NA

We can get the degrees of freedom with k-1 and N-k.

In [134]:
pf(0.8444, df1=4-1, df2=nrow(melt_values)-4, lower.tail=F)
0.475390077217608

Material overload. Here's Tukey's HSD. We're going to do pairwise testing between each of the groups to see if the means are the same or not. Notice these are adjusted/corrected for "multiple testing".

In [86]:
# TukeyHSD(aov(lm(y~x)))
many_tests <- TukeyHSD(anova, conf.level=0.05) %>% tidy()
many_tests
termcomparisonestimateconf.lowconf.highadj.p.value
variable group_2-group_110.481474 4.3919638 16.570984 0.7933050
variable group_3-group_1 6.720078 0.6305675 12.809588 0.9343905
variable group_k-group_117.697339 11.6078286 23.786849 0.4115235
variable group_3-group_2-3.761396 -9.8509065 2.328114 0.9873787
variable group_k-group_2 7.215865 1.1263546 13.305375 0.9204055
variable group_k-group_310.977261 4.8877509 17.066771 0.7695074

The lab. Regression test.

Return to the Rappers Dataset

For those of you who watched my midterm 1 review vids, this dataframe should be familiar. For those of you who want to run this at home, run the very last chunk first.

In [4]:
rapper_df
artist_namelegal_nameageoriginnet_worthstart_year
Nicki Minaj Onika Maraj 35 New York 75.0 2004
Jay Z Shawn Carter 48 New York 900.0 1986
Eminem Marshall Mathers 45 Missouri 190.0 1988
Kendrick Lamar Kendrick Duckworth 31 California 45.0 2003
Logic Robert Hall 28 Maryland 10.0 2009
E-40 Earl Stevens 50 California 10.0 1986
Nas Nasir Jones 45 New York 50.0 1991
Jadakiss Jason Phillips 43 New York 6.0 1991
Chance the Rapper Chancelor Bennett 25 Illinois 33.0 2011
Childish Gambino Donald Glover 34 California 12.0 2002
Kanye West Kanye West 41 Georgia 160.0 1996
Cardi B Belcalis Almanzar 25 New York 8.0 2015
G-Eazy Gerald Gillum 29 California 9.0 2008
2Pac Tupac Shakur 49 California 40.0 1987
YG Keenon Jackson 28 California 3.0 2009
ScHoolboy Q Quincy Hanley 31 California 4.0 2008
Lupe Fiasco Wasalu Jaco 36 Chicago 14.0 2000
Drake Aubrey Graham 31 International 100.0 2001
Joey BadA$$ Jo-Vaughn Scott 23 New York 4.0 2010
Snoop Dogg Calvin Broadus 46 California 135.0 1991
Iamsu! Sudan Williams 28 California 1.0 2007
J. Cole Jermaine Cole 33 New York 15.0 2007
Gucci Mane Radric Davis 38 Georgia 12.0 2001
Rich Brian Brian Imanuel 19 International 0.3 2015
Too $hort Todd Shaw 52 California 15.0 1984
Mac Dre Andre Hicks 48 California 1.5 1984
Missy Elliott Melissa Elliott 47 Virginia 50.0 1989
Notorious B.I.G. Christopher Wallace46 New York 160.0 1992

Scatterplot

Before you make a linear model, you're gonna wanna check if the data look somewhat linear.

In [16]:
library(ggplot2)
ggplot(rapper_df, aes(x=age, y=net_worth)) + geom_point()

Honestly, these data don't look so linear. We're going to run a model through it for education purposes though.

Fit the model

We can fit a linear model of age and net worth using `lm`.

In [111]:
library(dplyr)
net_worth <- rapper_df %>% pull(net_worth)
age <- rapper_df %>% pull(age)
model <- lm(net_worth ~ age)

Check out what `glance` does.

In [112]:
library(broom)
glance(model)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residual
value0.1112938 0.07711283164.3523 3.256014 0.082759072 -181.5491 369.0982 373.0948 702303.5 26
In [113]:
library(broom)
tidy(model)
termestimatestd.errorstatisticp.value
(Intercept)-146.427843125.868666 -1.163338 0.25525806
age 5.960135 3.303033 1.804443 0.08275907

Why do we have p-values in our output?

Because of formal hypothesis testing. We'll get to this soon.

Write the model

We have $y=mx+b$ with $m=5.9601$ and $b=-146.4278$.

Interpret the parameters

According to our data, we see that as an artist gets older by a year, the artist's net worth increases by 5.96 million dollars.

Calculate associated regression values

We'll use `augment` to get fitted values and residuals. Pay attention to how I renamed the columns so it's not super confusing.

In [117]:
augmented_model <- augment(model)
augmented_model <- augmented_model %>%
                    select(net_worth, age, .fitted, .resid) %>%
                    rename(fitted=.fitted,
                           residual=.resid)
head(augmented_model)
net_worthagefittedresidual
75 35 62.17688 12.823118
900 48 139.65864 760.341362
190 45 121.77823 68.221767
45 31 38.33634 6.663658
10 28 20.45594 -10.455937
10 50 151.57891 -141.578908

We can make so many plots with this dataframe

In [118]:
ggplot(augmented_model, aes(y=net_worth, x=age)) +
    geom_segment(aes(xend=age, yend=fitted), color="darkgrey") +
    geom_smooth(method="lm", color="black") +
    geom_point() +
    ggtitle("Peep the residuals")
In [119]:
ggplot(augmented_model, aes(sample=residual)) + 
    geom_qq_line(col="black") +
    geom_qq() +
    ggtitle("QQ plot of residuals")
In [120]:
ggplot(augmented_model, aes(y=residual, x=fitted)) + 
    geom_point(col="midnightblue") +
    geom_hline(aes(yintercept = 0)) +
    ggtitle("Shows patterns between the fitted line and predicted values")
In [121]:
library(tidyr)

reshape <- augmented_model %>% dplyr::select(residual, net_worth) %>%
              gather(key="type", value="value", net_worth, residual)

ggplot(reshape, aes(y=value)) +
  geom_boxplot(aes(fill = type)) +
  ggtitle("Look at the variation in residuals and in the response") +
  scale_fill_manual(values = c("darkgrey", "gold"))

Assumptions

These assumptions are different than the ones in the book.

  • $x, y$ linear in scatterplot
  • Residuals are normally distributed.
  • Independent observations
  • Standard deviation of the response variables are the same for all values of x
</font>

Hypothesis Testing

These are assumptions for $\beta_1$.

$H_o: \, \beta_1 = 0$
$H_1: \, \beta_1 \neq 0$

Why would we do this? </font>

In [127]:
tidy(model)
termestimatestd.errorstatisticp.value
(Intercept)-146.427843125.868666 -1.163338 0.25525806
age 5.960135 3.303033 1.804443 0.08275907

Confidence Intervals

We will test the same hypotheses. The format for confidence interval remains the same. It is the interval created by the mean plus/minus the error term.

This is "by hand". </font>

In [130]:
t_star <- qt(p=0.975, df=nrow(rapper_df)-2)
t_star
2.05552943864287

So, estimate +/- t_starstd.error. That is, $(5.906) \text{+/-} (-2.0553.303)$.

The following is "using R". Why am I checking the ranges? </font>

In [122]:
range(rapper_df$age)
  1. 19
  2. 52
In [123]:
new_ages <- c(20, 30, 40, 50)

Recall the model...

In [124]:
model
Call:
lm(formula = net_worth ~ age)

Coefficients:
(Intercept)          age  
    -146.43         5.96  
In [125]:
predict(model, newdata=data.frame(age=new_ages), interval = "confidence")
fitlwrupr
1-27.22514 -158.70290104.2526
2 32.37621 -46.92674111.6792
3 91.97756 24.81412159.1410
4151.57891 42.25228260.9055

Prediction Intervals

Say we want to predict for new data. It is based off the linear model only, unlike the CI which must contain the true value 95% of the time. PI is wider.

In [126]:
predict(model, newdata=data.frame(age=new_ages), interval = "predict")
fitlwrupr
1-27.22514-389.7388335.2885
2 32.37621-314.6378379.3902
3 91.97756-252.4650436.4201
4151.57891-203.5015506.6593

Take home from this

  1. Read this all over at home. So, take this document itself home.
  2. Be able to interpret output.
  3. Make sure you have written down all your assumptions. At this point, you don't want to be confusing tests... Make a nice table for yourself.
  4. T-star, T-critical value, T-statistic. ??? </font>

Preface: Load in data

In [3]:
# *                          #
#      COLUMN BY COLUMN      #
#                          * #

names = c("Nicki Minaj",
          "Jay Z",
          "Eminem",
          "Kendrick Lamar",
          "Logic",
          "E-40",
          "Nas",
          "Jadakiss",
          "Chance the Rapper",
          "Childish Gambino",
          "Kanye West",
          "Cardi B"
          )

real_names = c("Onika Maraj",
               "Shawn Carter",
               "Marshall Mathers",
               "Kendrick Duckworth",
               "Robert Hall",
               "Earl Stevens",
               "Nasir Jones",
               "Jason Phillips",
               "Chancelor Bennett",
               "Donald Glover",
               "Kanye West",
               "Belcalis Almanzar"
               )

age = c(35, 48, 45, 31, 28, 50, 45, 43, 25, 34, 41, 25)

state = c("New York",
          "New York",
          "Missouri",
          "California",
          "Maryland",
          "California",
          "New York",
          "New York",
          "Illinois",
          "California",
          "Georgia",
          "New York"
          )

worth = c(75, 900, 190, 45, 10, 10, 50, 6, 33, 12, 160, 8)

start = c(2004, 1986, 1988, 2003, 2009, 1986, 1991, 1991, 2011, 2002, 1996, 2015)

rapper_df = data.frame(artist_name=names, legal_name=real_names, age=age, origin=state, net_worth=worth, start_year=start)



# *                     #
#      ROW BY ROW       #
#                     * #

adds = rbind(
      c("G-Eazy", "Gerald Gillum", 29, "California", 9, 2008),
      c("2Pac", "Tupac Shakur", 49, "California", 40, 1987),
      c("YG", "Keenon Jackson", 28, "California", 3, 2009),
      c("ScHoolboy Q", "Quincy Hanley", 31, "California", 4, 2008),
      c("Lupe Fiasco", "Wasalu Jaco", 36, "Chicago", 14, 2000),
      c("Drake", "Aubrey Graham", 31, "International", 100, 2001),
      c("Joey BadA$$", "Jo-Vaughn Scott", 23, "New York", 4, 2010),
      c("Snoop Dogg", "Calvin Broadus", 46, "California", 135, 1991),
      c("Iamsu!", "Sudan Williams", 28, "California", 1, 2007),
      c("J. Cole", "Jermaine Cole", 33, "New York", 15, 2007),
      c("Gucci Mane", "Radric Davis", 38, "Georgia", 12, 2001),
      c("Rich Brian", "Brian Imanuel", 19, "International", 0.3, 2015),
      c("Too $hort", "Todd Shaw", 52, "California", 15, 1984),
      c("Mac Dre", "Andre Hicks", 48, "California", 1.5, 1984),
      c("Missy Elliott", "Melissa Elliott", 47, "Virginia", 50, 1989),
      c("Notorious B.I.G.", "Christopher Wallace", 46, "New York", 160, 1992)
      )

adds_df = data.frame(adds)
names(adds_df) = names(rapper_df)
rapper_df = rbind(rapper_df, adds_df)

rapper_df$age = as.numeric(rapper_df$age)
rapper_df$net_worth = as.numeric(rapper_df$net_worth)
rapper_df$start_year = as.numeric(rapper_df$start_year)