Lab 12. Regressing toward mediocrity.¶

History¶

Francis Galton was the first to describe the regression line, sometime in the late 1800's.
He was practicing eugenics. Francis Galton thought it would be better if people were taller. If you're interested, see Galton's paper. Where do you "stand" on this graph of mediocrity?

</font>

Review¶

We have seen regression before in this class.
You have (1) seen scatterplots, (2) ran lm() and wrote a regression line based on the output, (3) interpreted the slope, intercept, $R^2$, (4) spotted outliers.

Vocabulary¶

Causal effect v. association An association does not imply causal. The regression model will tell you about association between two variables.

We call these variables explanatory and response variables, unfortunately. We have the Xplanatory and the response as Y.

Once we have a linear model based on our data, we can calculated predicted or fitted values. We can subtract our predicted values from our observed values to calculate our residuals. </font>

Regression Test¶

We are testing hypotheses about the slope and the intercept.

Assumptions¶

These assumptions are different than the ones in the book. To test these assumptions, we look at graphs.

$x, y$ linear in scatterplot
Residuals are normally distributed.
Independent observations
Standard deviation of the response variables are the same for all values of x

</font>

The test itself¶

These are assumptions for $\beta_1$.

$H_o: \, \beta_1 = 0$
$H_1: \, \beta_1 \neq 0$

Why would we do this? </font>

library(MASS)
head(Boston)

Here's the fitted model plotted on top of the scatterplot.

library(ggplot2)
ggplot(augment_Boston, aes(nox, medv)) +
    geom_point() +
    ggtitle("Scatterplot of Median House Price v. Air Pollution") +
    ylab("Median House Price") +
    xlab("Air Pollution (Nitrogen Oxide)")

Let's fit a model. Use lm.

lm_Boston <- lm(medv ~ nox, data = Boston)
tidy(lm_Boston)

Interpret the slope.

library(broom)
library(ggplot2)
library(dplyr)
library(tidyr)

Attaching package: ‘dplyr’

The following object is masked from ‘package:MASS’:

    select

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

# GIVES ASSOCIATED "REGRESSION VALUES" WITH EACH COORDINATE
augment_Boston <- augment(lm_Boston)

head(augment_Boston)

Here's the fitted model plotted on top of the scatterplot.

ggplot(augment_Boston, aes(nox, medv)) +
    geom_segment(aes(xend = nox, yend = .fitted), color="darkgrey") +
    geom_point(col="midnightblue") +
    ggtitle("Look at the residuals") +
    geom_smooth(method = "lm", se = F, col="midnightblue")

ggplot(augment_Boston, aes(nox, medv)) +
    geom_point(col="darkgrey") +
    ggtitle("Look at the line") +
    ylab("Median House Price") +
    xlab("Air Pollution (Nitrogen Oxide)") +
    geom_smooth(method = "lm", se = F, col="midnightblue")

ggplot(augment_Boston, aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line(col="darkgrey")

ggplot(augment_Boston, aes(y = .resid, x = .fitted)) + 
  geom_point(col="midnightblue") +
  geom_hline(aes(yintercept = 0)) +
  labs(y = "Residuals", x = "Fitted values", title = "(c) Fitted vs. residuals")

reshape <- augment_Boston %>% dplyr::select(.resid, medv) %>%
  gather(key = "type", value = "value", medv, .resid)

ggplot(reshape, aes(y = value)) +
  geom_boxplot(aes(fill = type)) +
  ggtitle("Look at the variation in residuals and in the response") +
  scale_fill_manual(values = c("darkgrey", "gold"))

References¶

About R^2

Reading for You¶

Regression Test

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

term	estimate	std.error	statistic	p.value
(Intercept)	41.34587	1.811192	22.82800	9.866245e-80
nox	-33.91606	3.196337	-10.61091	7.065042e-24

medv	nox	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
24.0	0.538	23.09904	0.3738461	0.9009631	0.002017389	8.331520	1.186674e-05	0.1083546
21.6	0.469	25.43924	0.4603695	-3.8392447	0.003059265	8.329853	3.274492e-04	-0.4619693
34.7	0.469	25.43924	0.4603695	9.2607553	0.003059265	8.321347	1.905220e-03	1.1143297
33.4	0.458	25.81232	0.4821178	7.5876787	0.003355137	8.324722	1.403528e-03	0.9131470
36.2	0.458	25.81232	0.4821178	10.3876787	0.003355137	8.318690	2.630512e-03	1.2501159
28.7	0.458	25.81232	0.4821178	2.8876787	0.003355137	8.330619	2.032830e-04	0.3475207