Linear regression has made its mark in at least 5 quarter/semester long classes in my life. In those 5 courses under the guidance of several professors at two separate institutions, regression was presented differently. For me, it’s another regression season, so I wanted to challenge myself and compile all I know about the famous and infamous regression line.
Pokemon, for you aliens and living-under-rock-peoples, are creatures from a televised series of the same name. These creatures are either wild or under the coaching direction of Pokemon trainers who exercise their Pokemon in sportsmanly battle to increase their stats. A Pokemon has many stats, but two of which are “attack” and “special attack”.
To start off, here’s our data on a scatterplot. I let the “attack” level of a pokemon be our independent variable. Independent variables can also be referred to “predictors”, “covariates”, or “features”. The “special attack” stat, I let be our dependent variable which relies on the independent variable to determine its value.
In this scenario, we will be interested in estimating the regression function \(r(x)=E(Y|X=x)\) with our data. We are under the assumption that the relationship between attack, our \(X\), and special attack, \(Y\).
pokemon <- read_csv("../../data/pokemon.csv")
names(pokemon) <- gsub(" ", "", names(pokemon))
pokemon_sub <- pokemon %>% filter(Type1 %in% c("Water", "Fire", "Grass"))
pokemon_sub <- pokemon_sub %>% filter(Generation!=6)
ggplot(pokemon_sub, aes(x=Attack, y=Sp.Atk)) +
geom_point(alpha=0.5) +
xlab("attack") +
ylab("special attack") +
ggtitle("Attack v. Special Attack in Water, Fire, and Grass Pokemon") +
theme_minimal()
To estimate our regression function, we use the least squares method which minimizes the sum of squares of vertical differences from the regression line to the observed values.
Our data are in the format \((X_1, Y_1), ..., (X_n, Y_n)\) where \(n=\) 216 is the total number of rows in the following dataframe. Here is a preview of the first 6 Pokemon in the dataset with their types and attack stats.
pokemon_sub %>% select(Name, Type1, Type2, Attack, Sp.Atk) %>% head
Our goal is to predict \(Y\), a Pokemon’s special attack stat, based on its attack stat \(X\). Recall that our regression function was \(r(x)=E(Y|X=x)\). We can interpret this expectation as what we would expect of our variable \(Y\) given that we are only looking at a “vertical strip” at some given \(X=x\).
For example, if we were interested in \(E(Y|X=95)\), we would be concerning ourselves with the data lying only on the vertical segment below.
segment_values <- pokemon_sub %>% filter(Attack==95) %>% summarize(min=min(Sp.Atk), max=max(Sp.Atk))
ggplot(pokemon_sub, aes(x=Attack, y=Sp.Atk)) +
geom_segment(aes(y=55, yend=125, x=95, xend=95), col="red3", lwd=1.5, alpha=0.25) +
geom_point(alpha=0.5) +
xlab("attack") +
ylab("special attack") +
ggtitle("Attack v. Special Attack in Water, Fire, and Grass Pokemon") +
theme_minimal()
Based on our data, our plug-in estimator for \(E(Y|X=x)\) is the mean within that vertical strip. We calculate this to be \(\approx 93.67\).
pokemon_sub %>% filter(Attack==95) %>% summarize(sp_attack_mean=mean(Sp.Atk))
Our ideal linear regression function for special attack on attack draws a line through our data points such that the distances between our observed values of special attack \(Y_i\) are communally as close to our predicted values \(\hat{Y}_i\) as possible. We will be predicting \(E(Y|X=x)\) with the plug-in estimate of the mean for that strip based on our data. Based on proof from Adhikari, these stripwise estimators are the least squares predictors for \(Y|X\).
I have learned two ways of defining my regression line. What they share is this format
\[ Y=[\text{Slope}]X + [\text{Y-Intercept}]. \]
As we’ve learned in pre-algebra, we have a slope that describes how a unit increase in our \(x\) variable will increase our \(y\). Our y-intercept, which is sometimes outside of our prediction range of values, will tell us what we would expect \(y\) to be if our \(x\) was 0.
Just last semester (Fall 2018), I had the priviledge of taking probability theory with the well-known Professor Ani Adhikari who has won many a teach accolade including 2019’s Daily Californian Best Professor.
When I was in Dr. Adhikari’s class, I was rather astounded at how little I knew about the basis of regression. I had taken an entire Regression Analysis class and mathematical statistics, but the narrative of the bivariate normal (BVN) distribution never came up. With this as the case, I thought it’d be inappropriate if I did not write a section on simple linear regression from a probabilistic perspective.
Let the random variable \(Z\) is distributed as standard normal. To specify the standard normal distribution, we provide mean \(\mu=0\) and variance and \(\sigma^2=1\).
mu <- 0
sd <- 1
x <- seq(from = mu - 3*sd, to = mu + 3*sd, by = .01)
df <- data.frame(x=x, y=dnorm(x, mean=mu, sd=sd))
# * SAUER'S SHADING FUNCTION
shade_curve <- function(MyDF, zstart, zend, fill = "red", alpha = .5) {
geom_area(data = subset(MyDF, x >= mu + zstart*sd
& x < mu + zend*sd),
aes(y=y), fill = fill, color = NA, alpha = alpha)
}
# * PLOTTING
my_col <- "red3"
ggplot(df, aes(x=x, y=y)) + geom_line() +
shade_curve(MyDF=df, zstart=-1, zend=1, fill=my_col, alpha=0.3) +
shade_curve(MyDF=df, zstart=1, zend=2, fill=my_col, alpha=0.5) +
shade_curve(MyDF=df, zstart=-2, zend=-1, fill=my_col, alpha=0.5) +
shade_curve(MyDF=df, zstart=2, zend=6, fill=my_col, alpha=0.7) +
shade_curve(MyDF=df, zstart=-3, zend=-2, fill=my_col, alpha=0.7) +
scale_x_continuous(breaks=-3:3) +
scale_y_continuous(breaks=NULL) +
theme_minimal() +
ylab("") +
xlab("possible values of Z") +
ggtitle("Standard Normal Z")
An extension to the standard normal distribution is the bivariate normal (BVN) distribution. Two random variables \(X\) and \(Y\) that are distributed as bivariate normal achieve a “football shape” in Euclidean space \(\textbf{R}^2\). To fully determine a BVN distribution between two random variables, we provide a mean vector \([\mu_X, \mu_Y]\) and the covariance matrix between \(X\) and \(Y\).
Here’s a bivariate normal distribution with mean vector
\[ \mu = [0, 0] \]
and covariance matrix of
\[ \Sigma = \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}. \]
# * EDITED FROM STEORTS' CODE AND STACK
# * https://stackoverflow.com/questions/36221596/plot-multivariate-gaussian-contours-with-ggplot2
# * https://stackoverflow.com/questions/29476925/change-colors-of-raster-plot-in-ggplot2
range_x <- seq(-3, 3, length.out=100)
range_y <- range_x
mu <- c(0, 0)
rho <- 0.6
sigma <- matrix(c(1, rho, rho, 1), nrow=2)
# * CREATE X/Y VALUES AND PROBABILITIES
data.grid <- expand.grid(range_x, range_y)
q_samp <- cbind(data.grid,
p=dmvnorm(data.grid, mean=mu, sigma=sigma))
names(q_samp) <- c("range_x", "range_y", "probability")
# * PLOT
ggplot(q_samp, aes(x=range_x, y=range_y, z=probability)) +
geom_raster(aes(fill=probability)) +
geom_contour(col="white", alpha=0.8) +
coord_fixed(xlim = c(-3, 3), ylim = c(-3, 3), ratio = 1) +
xlab("possible values of random variable X") +
ylab("possible values of random variable Y") +
theme_minimal() +
scale_fill_gradientn(colours=c("white","red2","red3"))
Flashback to the vertical strips we were talking about earlier, we know that \(E(Y|X)\) is the least squares predictor of \(Y\) based on any function of our covariate \(X\). Because we are under the assumption that the relationship between our variables is linear, we will now look at functions of \(X\) in the form \(f(X)=mX+b\).
Our task now is now to minimize mean squared error (MSE) over values of our slope \(m\) and y-intercept \(b\). The formula for MSE is
\[ MSE(m,b) = E[(Y-(mX+b))^2]. \]
We do this with calculus and rules of expectation. These steps are provided by Adhikari.
We begin by letting \(m\) be fixed so we can optimize over \(b\).
\[ \begin{aligned} MSE(m,b) &= E[(Y-(mX+b))^2] \\ &= E[((Y-mX)-b)^2] \\ &= E[(Y-mX)^2] - 2bE[Y-mX] + b^2 \end{aligned} \]
We’ll take the derivative with respect to \(b\).
\[ \frac{\partial}{\partial b} MSE(m,b) = -2E[Y-mX] + 2b \]
We set this equal to 0. Finally,
\[ 0 = -2E[Y-mX] + 2b \implies b^* = E[Y-mX] = E[Y] - mE[X]. \]
To be thorough, we have to check the second derivative is a minimum.
\[ \frac{\partial^2}{\partial b^2} MSE(m,b) = 2 \]
Our resulting sign is positive which implies we have a minimum.
Now, we will solve for \(m\), but replacing \(b\) with our \(b^*\).
\[ \begin{aligned} MSE(m,b^*) &= E[(Y-(mX+b^*))^2] \\ &= E[Y-(mX+E[Y]-mE[X])^2] \\ &= E[[(Y-E[Y])-m(X-E[X])]^2] \end{aligned} \]
We have now changed our MSE formula into an equation that only contains \(m\). We’ll continue simplifying below.
\[ \begin{aligned} MSE(m,b^*) &= E[[(Y-E[Y])-m(X-E[X])]^2] \\ &= E((Y-E[Y])^2) - 2mE[(X-E[X])(Y-E[Y])] + m^2E([X-E[X]])^2 \\ &= \text{Var}(Y) - 2m\text{Cov}(X,Y) + m^2\text{Var}(X) \end{aligned} \]
After all that calculation, look at that beautiful result! We can now take the derivative with respect to \(m\) to retrieve our \(m^*\).
\[ \frac{\partial}{\partial m} MSE(m,b^*) = -2\text{Cov}(X,Y) + 2m\text{Var}(X) \]
When we set the above to 0, we have this result.
\[ 0 = -2\text{Cov}(X,Y) + 2m\text{Var}(X) \implies m^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} \]
Again, we check the second derivative.
\[ \frac{\partial^2}{\partial m^2} MSE(m,b^*) = 2\text{Var}(X) \]
This value is strictly positive because variance is nonnegative, thus we are sure we have a minimum.
The values of \(b^*\) and \(m^*\) are our least squares coefficients.
\[ \begin{align} m^* &= \frac{\text{Cov}(X,Y)}{\text{Var}(X)} \\ b^* &= E[Y] - m^*E[X] \end{align} \]
The best linear predictor of \(E(Y|X)\), regardless of the true relationship between \(X\) and \(Y\), is the least squares line defined by
\[ \begin{align} Y &= m^*X+b^* \\ &= \frac{\text{Cov}(X,Y)}{\text{Var}(X)} + E[Y] - m^*E[X] \\ &= \rho \frac{\sigma_Y}{\sigma_X} + \big(\mu_Y - \rho \frac{\sigma_Y}{\sigma_X} \mu_X\big) \end{align} \]
When \(X\) and \(Y\) have a joint distribution that is bivariate normal, then the best predictor of \(E(Y|X)\) is truly linear. That best line is \(Y=m^*X+b^*\). Even when \(X\) and \(Y\) are not jointly bivariate normal, this line is still the best linear predictor for \(E(Y|X)\), although it may not always be appropriate. (See the section on common errors.)
Also to be well noted, the variance of \(Y|X\) is
\[ \text{Var}(Y|X) = (1-\rho^2)\sigma_Y^2. \]
Notice that this variance only depends on the variance of \(Y\). The variance is constant throughout each vertical strip given the bivariate normal setting.
We now have information about how to predict \(E(Y|X)\) based on the bivariate normal setting. We also know that even if our \(X\) and \(Y\) are not distributed as bivariate normal our \(Y=m^*X+b^*\) still does fine work as the best linear estimator of \(E(Y|X)\). How does this all work on a dataset?
I ask my students this question that is related but not completely on-topic: “What do we do whenever we see hypotheses?” Their answer? “Check assumptions!” It was definitely unclear to me in the past what these assumptions were based upon, but now we see the bivariate normal case, it makes sense why we will need to assume certain things when applying theory to make models for our data.
We’ll see these assumptions soon.
I am now incorporating information from my mathematical statistics class that I took with Professors Debashis Paul, Jane-Ling Wang, Hans-Georg Mueller, and Prabir Burman. I could easily spend hours talking about the incredible education and experience I had with UC Davis Stats.
In the previous section, we spent a whole lot of time building up theory on how to minimize MSE which resulted in a least squares prediction model of \(Y=m^*X+b^*\). This section builds completely off intuition from the previous one.
Let’s start off with a straight line: \(y=mx+b\). (I never outgrew the slope-intercept form from 7th grade.) Let’s rewrite replace our \(m\) and \(b\) with Greek letters to form \(y=\beta_0+\beta_1x\). This is the formula for an straight line, but using the \(\beta_i\) notation is the standard way to present coefficients for linear regression. The following proof is provided by Morris and DeGroot (2011).
The sum of squared vertical distances from \(y_i\) to \(\hat{y}_i\) for \(n\) observations is written as
\[ Q = \sum_{i=1}^{n} [y_i-(\beta_0+\beta_1x_i)]^2. \]
Our goal is to minimize \(Q\), so we have to take partial derivatives with respect to the betas.
\[ \begin{align} \frac{\partial Q}{\partial \beta_0} &= -2 \sum_{i=1}^{n} [y_i-(\beta_0+\beta_1x_i)] \\ \frac{\partial Q}{\partial \beta_1} &= -2 \sum_{i=1}^{n} x_i[y_i-(\beta_0+\beta_1x_i)] \end{align} \]
We set these two derivatives to 0 and simplify.
\[ \begin{align} 0 &= -2 \sum_{i=1}^{n} [y_i-(\beta_0+\beta_1x_i)] \\ &= \sum_{i=1}^{n} [y_i-(\beta_0+\beta_1x_i)] \\ &= \sum_{i=1}^{n}y_i - \sum_{i=1}^{n}\beta_0 - \sum_{i=1}^{n} \beta_1x_i \\ &= \sum_{i=1}^{n}y_i - n\beta_0 - \beta_1\sum_{i=1}^{n} x_i \end{align} \]
\[ \begin{align} 0 &= -2 \sum_{i=1}^{n} x_i[y_i-(\beta_0+\beta_1x_i)] \\ &= \sum_{i=1}^{n} x_i[y_i-(\beta_0+\beta_1x_i)] \\ &= \sum_{i=1}^{n}x_iy_i - \sum_{i=1}^{n} xi\beta_0 - \sum_{i=1}^{n} \beta_1x_i^2 \\ &= \sum_{i=1}^{n}x_iy_i - \beta_0\sum_{i=1}^{n} xi - \beta_1\sum_{i=1}^{n} x_i^2 \end{align} \]
We use the above results to produce what are called the normal equations for \(\beta_0\) and \(\beta_1\).
\[ \begin{align} \beta_0n + \beta_1 \sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} y_i \\ \beta_0\sum_{i=1}^{n}x_i + \beta_1 \sum_{i=1}^{n}x_i^2 &= \sum_{i=1}^{n} x_iy_i \end{align} \]
From these equations, we are able to find the least squares coefficients for \(\beta_0\) and \(\beta_1\) that minimize MSE. The coefficients are estimated with
\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^{n} (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^{n} (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y}-\hat{\beta}_1\bar{x} \end{align} \]
which clearly show similarity to the coefficients we derived earlier
\[ \begin{align} m^* &= \frac{\text{Cov}(X,Y)}{\text{Var}(X)} \\ b^* &= E[Y] - m^*E[X]. \end{align} \]
Notice we have used plug-in estimators to estimate the values for \(m^*\) and \(b^*\).
Previously, we had that for our \(Y=m^*X+b^*\) to be the best among all possible prediction models, \(X\) and \(Y\) had to be bivariate normal. In order to apply this to empirical data, we are going to make the following simplifying assumptions.
Our regression model is
\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]
where \(\epsilon_i\), the errors/residuals, are distributed \(N(0,\sigma^2)\) and are constant throughout the model. We have found our fitted simple linear regression line to be
\[ y_i = \hat{\beta}_0 + \hat{\beta}_1x_i. \]
given the coefficients above. Our unbiased estimate of \(\sigma^2\) is
\[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^{n} \hat{\epsilon}_i \]
where \(\hat{\epsilon}_i = y_i - (\hat{\beta_0}+\hat{\beta_1}x_i)\).
Under the assumption \(Y_i|X_i \sim N(\mu_i, \sigma^2)\), the least squares estimator of \((\beta_0, \beta_1)\) is also the maximum likelihood estimator.
The expectation of the estimate is the true parameter, i.e. \(E(\hat{\beta}_i|\textbf{X})=\beta_i \, \forall \, i\)
Given \(s^2_X=\frac{1}{n} \sum_{i=1}^{n}(X_i-\bar{X}_n)^2\), we have
\[ \text{Var}(\hat{\beta}_i|\textbf{X}) = \frac{\sigma^2}{ns^2_X} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} X_i^2 & -\bar{X}_n \\ -\bar{X}_n & 1 \end{bmatrix}. \]
Estimators are consistent, i.e. \(\hat{\beta_i} \overset{P}\to \beta_i \, \forall \, i\)
They have asymptotic normality, i.e. \(\frac{\hat{\beta}_i - \beta_i}{\hat{SE}(\hat{\beta}_i)} \overset{D}\to N(0,1) \, \forall \, i\)
With the above information, we can conduct inference on hypotheses like \(H_0: \beta_1=0\) making use of our sample data and a t-test with \(n-2\) degrees of freedom. Of course, if we had the true (non-estimated) standard errors, then we could use a Wald test.
It was probably really unfun of me to not mention Pokemon in the previous section. Well, it’s time to bring back that fun. Here’s a preview of that data again, since scrolling back would be a pain.
pokemon_sub %>% select(Name, Type1, Type2, Attack, Sp.Atk) %>% head
On the first day of one of my classes this semester, current professor of mine Professor Mark van der Laan etched into my head that the first thing I should be able to do as a statistician is understand my data generating process (DGP).
The data I have selected are “Pokedex data” on Pokemon. These Pokemon stats are (1) manmade but also (2) have no measurement error (unless someone fudged with the data before I got it!). I remain unaware of how The Pokemon Company, its engineers, and its designers have decided how attack and special attack should vary with one another. From what I understand, it is a well-developed system.
Pokemon have different “types” rendering their moves in battle against any other given pokemon more or less effective in lowering hitpoints. To be exact, in this dataset there are 18 different Pokemon types which are as follows.
pokemon %>% pull(Type1) %>% unique
[1] "Grass" "Fire" "Water" "Bug"
[5] "Normal" "Poison" "Electric" "Ground"
[9] "Fairy" "Fighting" "Psychic" "Rock"
[13] "Ghost" "Ice" "Dragon" "Dark"
[17] "Steel" "Flying"
For my regression model, I have subset my data to only include the three primary Pokemon types: grass, water, and fire. I had no other reason in doing this aside from limiting the number of observations in my dataset. Therefore, the line that we are fitting will only be generalizable to Pokemon that have these types.
Furthermore, I have subset my data to only cinlude pokemon from generations 1 through 5. I did this in order to leave some pokemon out for prediction. I have reason to believe that the DGP that yielded attack and special attack stats from pokemon from generations 1 through 5 will also generalize to generation 6. (There’s no way that the creators would just suddenly create insanely different pokemon when eventually we are able to transfer pokemon from one generational game to another.)
Let me remind you of our scatterplot from earlier.
ggplot(pokemon_sub, aes(x=Attack, y=Sp.Atk, col=Type1)) +
geom_point(alpha=0.5) +
xlab("attack") +
ylab("special attack") +
ggtitle("Attack v. Special Attack in Water, Fire, and Grass Pokemon") +
theme_minimal()
This time, I’ve colored in our observations based on which major type a Pokemon is in my dataset. We see that there is a pretty nice scatter, and leading me to believe that I can model these pokemon types together.
We should also be interested in looking at the correlation \(r\) of these data.
cor(pokemon_sub$Attack, pokemon_sub$Sp.Atk)
[1] 0.5590745
Since \(r=0.553\) and correlation is a number anywhere between -1 and 1, we can conclude a positive association between attack and special attack. However, this is not a very high value of correlation, so we will proceed with our computation with caution.
While all of the calculations above made it seem like linear regression is super rigorous to calculate, R
makes it very convenient. All you have to do is use lm(Y~X, data=some_matrix)
, then using summary()
to examine the model fit.
Call:
lm(formula = Sp.Atk ~ Attack, data = pokemon_sub)
Residuals:
Min 1Q Median 3Q Max
-85.285 -13.565 -0.783 12.709 88.442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.17785 4.79478 7.128 1.54e-11 ***
Attack 0.57934 0.05873 9.864 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.52 on 214 degrees of freedom
Multiple R-squared: 0.3126, Adjusted R-squared: 0.3094
F-statistic: 97.3 on 1 and 214 DF, p-value: < 2.2e-16
The line of best fit ended up being \(y=0.3996x+41.2513\). We can plop this right on top of our data using plot()
.
plot(x=pokemon_sub$Attack, y=pokemon_sub$Sp.Atk,
pch=19,
col=alpha("black", 0.4),
xlab="attack",
ylab="special attack",
main="Attack v. Special Attack in Water, Fire, and Grass Pokemon")
abline(reg=pokemon_model, lwd=4, col=alpha("red3", 0.7))
This line doesn’t seem to follow the pattern I expected it would. If you look at the 1:1 line, you see data cluster in that area.
plot(x=pokemon_sub$Attack, y=pokemon_sub$Sp.Atk,
pch=19,
col=alpha("black", 0.4),
xlab="attack",
ylab="special attack",
main="Attack v. Special Attack in Water, Fire, and Grass Pokemon")
abline(a=0, b=1, lwd=4, col=alpha("red3", 0.7))
However unexpected, this is unsurprising because there are a number of data points below the identity line that are influential points which pull the line of best fit lower and toward them. Earlier, we mentioned that we are working under the assumption that the variance for each \(X=x\) is uniformly based on \(Y\), but this assumption is quite obviously not met in practice. For observations with higher attack stats, the variance of special attack is much wider than observations with lower attack stats.
Given our assumptions hold true, for Pokemon with attack stats as small as 5 to as large as 190, we can predict their special attack stat. It is important that we do not extrapolate, which means to predict outside of the range we modelled our line with.
dplyr
and ggplot2
When implementing regression using tidyverse
functions, the first line of pokemon_model <- lm(Sp.Atk ~ Attack, data=pokemon_sub)
would remain the same. Instead of using summary()
, we will use functions from library(broom)
.
Using tidy()
, we can see coefficient information. This is the same information from summary()
, but its arguably more visually appealing.
tidy(pokemon_model)
And using glance()
, we can see information about model fit. Earlier, we calculated the correlation between \(X\) and \(Y\) to be \(\approx 0.553\). The following value of \(R^2\) is \(0.305\) and can be interpretated as the amount of variation in \(Y\) that can be explained by \(X\).
glance(pokemon_model)
Notably, the value of correlation squared is the \(R^2\) value.
correlation <- cor(pokemon_sub$Attack, pokemon_sub$Sp.Atk)
c(correlation=correlation, correlation_squared=correlation^2)
correlation correlation_squared
0.5590745 0.3125643
Instead of using plot()
, we can using ggplot()
and display our line of best fit on our scatterplot. In the following, I could have used either geom_abline()
or geom_smooth(method="lm")
, but I just used the latter for no good reason.
ggplot(pokemon_sub, aes(x=Attack, y=Sp.Atk)) +
geom_point(alpha=0.5) +
geom_smooth(method="lm", se=FALSE, col="red3", lwd=1.5, alpha=0.7) +
xlab("attack") +
ylab("special attack") +
ggtitle("Attack v. Special Attack in Water, Fire, and Grass Pokemon") +
theme_minimal()
As I mentioned earlier, I subset my data to only include fire, grass, and water pokemon from the first five generations of Pokemon. I will now use my line of best fit to predict special attack stats for generation 6 pokemon.
gen_6 <- pokemon %>% filter(Type1 %in% c("Water", "Fire", "Grass"))
gen_6 <- gen_6 %>% filter(Generation==6)
gen_6 %>% select(Name, Type1, Type2, Attack, Sp.Atk) %>% head
Earlier, I mentioned that we could only make predictions for pokemon with attack stats of between 5 and 190. We can check this.
c(gen6_min_attack=min(gen_6$Attack), gen6_max_attack=max(gen_6$Sp.Atk))
gen6_min_attack gen6_max_attack
45 130
Because we have the above results, we can continue to prediction.
In order to make predictions, you use the predict()
function. I used: predict(pokemon_model, newdata=gen_6 %>% select(Attack))
which plugs in our \(x\) for us and does the plug-and-chug math. For Chepsin, our prediction is \(y=0.5787261(61)+34.6482218\).
gen_6_predictions <- predict(pokemon_model, newdata=gen_6 %>% select(Attack))
Below, I have presented the errors \(y_i-\hat{y}_i\).
gen_6_predictions <- cbind(gen_6 %>% select(Name, Type1, Type2, Attack, Sp.Atk), Sp.Atk.Pred=gen_6_predictions)
gen_6_predictions <- gen_6_predictions %>% mutate(Error=Sp.Atk-Sp.Atk.Pred)
gen_6_predictions
On average, our prediction of special attack was off by 22.16 points. That’s pretty off. We can see why by looking at the following plot.
ggplot(gen_6_predictions, aes(x=Attack, y=Sp.Atk)) +
geom_point(alpha=0.5, aes(col=Type1)) +
geom_abline(slope=0.5787261, intercept=34.6482218, col="red3", lwd=1.5, alpha=0.7) +
xlab("attack") +
ylab("special attack") +
ggtitle("Attack v. Special Attack in Water, Fire, and Grass Pokemon") +
theme_minimal()
In the end, I’m not surprised our predictions were’t accurate. Our \(R^2\) value was only 0.33. This example focused on fire, water, and grass pokemon with no reason why. Here are scatterplots for attack versus special attack for all the provided pokemon types.
ggplot(data=pokemon, aes(x=Attack, y=Sp.Atk, col=Type1)) +
geom_point(alpha=0.5) +
facet_wrap(~Type1) +
theme_minimal() +
xlab("attack") +
ylab("special attack") +
theme(legend.position="none") +
ggtitle("Attack and Special Attack Between Pokemon Types")
The relationship between attack and special attack is seemingly unique to each of the pokemon types. In the future, I really ought to do a thorough data exploration before committing to a linear model.
I will close my analysis with a quote from the professor of my undegrad capstone course, the ever incredible Professor Duncan Temple Lang.
“Immerse in your data so that you should not be surprised by your results.”
-Duncan Temple Lang
Linear regression is so popular, but it is just almost as popularly misused. Did you notice how difficult it was to derive the coefficients and theory but how easy software makes the implementation? This becomes a terrible problem.
Many courses and many disciplines teach regression as if assumptions don’t matter. But they do! In order to interpret a linear model and to at all find confidence (not statistical confidence) that your regression model will do any good, you need to assess your assumptions. This requires making plots and some deep thought about whether assumptions are even practical.
Correlation between variables has also led to the belief that correlation implies causation. While causation can definitely be visualized by a graph that has correlated variables, correlation is not enough to say that a covariate implies an outcome.
Extrapolation, using your fitted line to predict for values outside of the line, is very bad! And I’ve seen many students do this, so I don’t think it’s nuts of me to believe that they’ve learned it from someone or there are others who have not learned this but still do it.
Lastly, generalization. I spoke about the data generating process that gave rise to my data and that only for other observations that have been sampled from that same process will I be able to predict special attack for. To be clear, that means I cannot use Yu-Gi-Oh or Digimon stats to predict special attack! Pokemon only!
I’m pretty glad I did this synthesis. Hope you liked it. This was certainly a product of all my educators who gave me their best efforts.
If you have questions,concerns, or corrections, do let me know. I want to be on the right side of the regression schism.
Adhikari, A., & Pitman, J. (20xx). “Chapter 22: Prediction”. Probability for Data Science. Creative Commons. [URL]
Adhikari, A., & Pitman, J. (20xx). “Chapter 24: Simple Linear Regression”. Probability for Data Science. Creative Commons. [URL]
DeGroot, M. H., & Schervish, M. J. (2012). “Chapter 11: Linear Statistical Models”. Probability and statistics. Pearson Education.
Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). “Chapter 2: Inferences in Regression and Correlation Analysis”. Applied linear statistical models (Vol. 4, p. 318). Chicago: Irwin.
Sauer, S. (2016). Shading multiple areas under normal curve. [URL]
Steorts, R. (2015). Visualizing the Multivariate Normal, Lecture 9. [URL]
Wasserman, L. (2013). “Chapter 13: Linear and Logistic Regression”. All of Statistics. Springer Science & Business Media.
I downloaded the Pokedex data from data.world but the original source is Kaggle.