The Student Dataset

The student dataset student.csv contains 8 measurements for 200 student subjects. These students either are male or female, went to private or public school, all received scores for the subjects of reading, writing, math, and science, were in an honors program or not, and have had their family’s socioeconomic status recorded.

Displayed below are a mosaic plot of the categorical data of students with high socioeconomic status and histograms of their corresponding scores. Attached in Appendix A are the figures for students with low and medium socioeconomic status (SES).

For all students, we see that more females are in honors programs. As SES decreases, there is an obvious decrease in amount of students participating in honors programs. Both high and middle SES students have a similar distribution of public/private school students, there being far more public school enrollees than private school students. Low SES students hardly are enrolled in private schools, but if they are, then they are more likely to be female than male. In contrast with the distribution of the categorical variables (gender, school type, honors program status) for students with low and medium SES, we see that there are more honors program students overall for high SES students. Unsurprisingly, the score histograms for high SES students are left-skewed, and students outperform medium and low SES in subject scores.

Fitting the Model

Our goal in creating a model to student.csv is to predict socioeconomic status from a student’s sex, school type, and subject scores. In order to fit a model optimal for prediction to students, we used Backward-Forward Model Selection using AIC as our criterion.

The best fit model was SES ~ gender + schtyp + read + honors with the following estimated coefficients with and AIC of 403.7529. This model had the lowest AIC as a result of the Backward-Forward Model Selection procedure.

##        (Intercept) gendermale schtyppublic        read  honorsyes
## low       2.644419 -0.8675304    1.6367326 -0.07403496 -0.2937044
## middle    2.304598 -0.1910837   -0.0462631 -0.02470590 -1.1706580

Our multinomial logistic regression model is: \(\pi(\vec{X})=\alpha + \beta_1X^{gender} + \beta_2X^{schtyp} + \beta_3X^{read} + \beta_4X^{honors}\).

We can also interpret the following from the above model.

The relative probability of being in low SES is \(exp(-0.8675304)=0.4199875\) times that of being in high SES for a male student versus a female student, all other variables being held constant.

The relative probability of being in middle SES is \(exp(-0.1910837)=0.8260634\) times that of being in high SES for a male student versus a female student, all other variables being held constant.

The relative probability of being in low SES is \(exp(1.6367326)=5.138353\) times that of being in high SES for a public school student versus a private school student, all other variables being held constant.

The relative probability of being in middle SES is \(exp(-0.0462631)=0.9547907\) times that of being in high SES for a male student versus a female student, all other variables being held constant.

When a specific student’s reading score increases by 1, the probability of being in low SES is \(exp(-0.07403496)=0.9286392\) times that of high SES, with all other variables held constant.

When a specific student’s reading score increases by 1, the probability of being in middle SES is \(exp(-0.02470590)=0.9755968\) times that of high SES, with all other variables held constant.

The relative probability of being in low SES is \(exp(-0.2937044)=0.7454968\) times that of being in high SES for an honors student versus a non-honors student, all other variables being held constant.

The relative probability of being in middle SES is \(exp(-1.1706580)=0.3101628\) times that of being in high SES for an honors student versus a non-honors student, all other variables being held constant.

Diagnostics

We will perform diagnostics on our multinomial logistic model to identify outliers and other influential points. Then, we will assess the predictive power of our model.

Diagnostics for Low to High Relative Model

Standardized Residuals If our model fits our data well, then the residuals should be distributed as standard normal.

## Mean:  -0.03653888 
## Variance:  0.9581778

These residuals have a mean of -0.03653888 and a variance of 0.9581778. According to the histogram, these residuals are skewed left, their peak being near -1. We believe that our model fits our data relatively well.

Dfbeta Using Dfbeta with a threshold of 0.2, we were able to find that the following observations were influential:

A male public school, non-honors student with a reading score of 36
A male public school, non-honors student with a reading score of 60

##    (Intercept) gendermale schtyppublic read honorsyes y     dBhat
## 1:           1          0            0   36         0 0 0.2388299
## 2:           1          0            0   60         0 1 0.7502532

Proportional Reduction in Squared Error Our value of Proportional Reduction in Squared Error is \(0.2099446<<1\). This suggests that when we add our X variables to our model, the predictive power does not significantly change from the baseline \(\bar{y}\). Even though we used Backward-Forward model selection with AIC, it is apparent that our best model has low predictive power.

## [1] 0.2099446

Diagnostics for Middle to High Relative Model

Standardized Residuals If our model fits our data well, then the residuals should be distributed as standard normal.

## Mean:  -0.02513983 
## Variance:  1.065217

The mean and variance of these residuals are -0.02513983 and 1.065217 respectively which are not significantly deviant from what we would expect if the model fit our data well. According to the histogram, the residuals have a bimodal distribution. We believe that our model fits our data relatively well.

Dfbeta Using Dfbeta with a threshold of 0.2, we were able to find that the following observations were influential:

-A male non-honors public school student with a reading score of 36

##    (Intercept) gendermale schtyppublic read honorsyes y     dBhat
## 1:           1          0            0   36         0 0 0.2105382

Proportional Reduction in Squared Error Our value of Proportional Reduction in Squared Error is \(0.1049546<<1\). This suggests that when we add our X variables to our model, the predictive power does not significantly change from the baseline \(\bar{y}\). Even though we used Backward-Forward model selection with AIC, it is apparent that our best model has low predictive power.

## [1] 0.1049546

Prediction

We will predict the socioeconomic class for a female student who scored 60 on all subject tests, attended private school, and was not in an honors program.

##       high        low     middle 
## 0.29058933 0.04814572 0.66126495

According to the above output, this student is most likely a student from middle SES because the predicted probability is highest in the middle SES column.

##         preds
## actual   high low middle
##   high     24   3     31
##   low       7  11     29
##   middle   15   5     75

## [1] 0.45

For all predictions, we have created an error matrix. The error rate was 45%. This error rate is high, probably because the score distributions across all SES do not greatly differ from one another.

Conclusion

Our model was poor at predicting SES level given the predictors of gender, honors/not, public school/private school, and reading level. The error rate of our model was 45%, and our model had the most difficulty predicting for low SES students. That is, students who were from low SES were more likely to be predicted as middle SES than low SES. This was not the case for middle or high SES which were predicted mostly as what they truly were.

From the histograms, we can see a few patterns: writing scores for all SES levels are skewed right and when we increase SES level, the distribution of math scores changes drastically. It should be noted that math scores were not included in our best fit model using Backward-Forward with AIC, although they showed a very strong visual distinction in the histograms.

Appendix A: Supplementary Figures

These figures for are for low socioeconomic status.

These figures for are for middle socioeconomic status.

Appendix B: Problem 1 Code

knitr::opts_chunk$set(echo = TRUE, fig.align="center")
# LIBRARIES
library(foreign)
library(nnet)
library(ggplot2)
library(vcd)
library(grid)
library(LogisticDx)

# STUDENT DATASET
student = read.csv("student.csv", as.is=TRUE)
head(student)
names(student)
vplayout = function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)


# EXPLORING HIGH SOCIOECONOMIC CLASS
high_ses = student[student$ses=="high",]
high_ses_table = table(high_ses[,c("gender", "schtyp", "honors")])
mosaicplot(high_ses_table)
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
read_high = ggplot(high_ses, aes(x = read))
print(read_high + geom_histogram(bins=10) + ggtitle("Reading Scores (High SES)"), vp=vplayout(1, 1))
write_high = ggplot(high_ses, aes(x = write))
print(write_high + geom_histogram(bins=10) + ggtitle("Writing Scores (High SES)"), vp=vplayout(1, 2))
science_high = ggplot(high_ses, aes(x = science))
print(science_high + geom_histogram(bins=10) + ggtitle("Science Scores (High SES)"), vp=vplayout(2, 1))
math_high = ggplot(high_ses, aes(x = math))
print(math_high + geom_histogram(bins=10) + ggtitle("Math Scores (High SES)"), vp=vplayout(2, 2))
# OUR FULL MODEL
# BASELINE IS HIGH CLASS
full_model = multinom(factor(ses) ~ gender + schtyp + read + write + math + science + honors,
                      data = student, trace = FALSE)
reduced_model = multinom(factor(ses) ~ 1, data = student, trace = FALSE)

# BACKWARD-FORWARD WITH AIC TO CREATE PREDICTIVE MODEL
# Our predictive model includes gender, school type, read, honors.
bf_aic_fit = step(full_model,scope = list(lower = reduced_model, upper = full_model), 
                   direction = "both", criterion = "AIC", trace = FALSE)
# OUR BEST MODEL
best_model = multinom(factor(ses) ~ gender + schtyp + read + honors, data = student, trace = FALSE, family=mul)
summary(best_model)$coefficients
# SPLIT DATA
test_data = student[,c("gender", "schtyp", "read", "write", "math", "science", "honors", "ses")]
split_data = split(test_data, test_data$ses)

low_high = rbind(split_data[[2]], split_data[[1]])
med_high = rbind(split_data[[3]], split_data[[1]])
# THESE ARE THE MODELS WE WILL DO DIAGNOSTICS FOR
lh_model = glm(factor(ses) ~ gender + schtyp + read + honors, data = low_high, family = binomial(logit))
mh_model = glm(factor(ses) ~ gender + schtyp + read + honors, data = med_high, family = binomial(logit))
# LOW HIGH: OUTLIERS AND INFLUENTIAL POINTS
lh_influential_vals = dx(lh_model)
lh_pearsons = lh_influential_vals$Pr #Pearsons Residuals
lh_deviance = lh_influential_vals$dr #Deviance Residuals
lh_stdresid = lh_influential_vals$sPr #Standardized residuals (Pearson)
lh_dfbeta = lh_influential_vals$dBhat #DF Beta for removing each observation
lh_pearson_change = lh_influential_vals$dChisq #Change in pearson X^2 for each observation
lh_likelihood_change = lh_influential_vals$dDev #Change in LR-test G^2 for each observation
for_hist = data.frame(lh_stdresid)
hist_lh_stdresid = ggplot(data = for_hist, aes(x = lh_stdresid))
hist_lh_stdresid + geom_histogram(bins=10) + ggtitle("Standardized Residuals for Low to High")
cat("Mean: ", mean(lh_stdresid), "\nVariance: ", var(lh_stdresid))
cutoff_beta = 0.20
values = lh_dfbeta[lh_dfbeta > cutoff_beta] #Shows the values
lh_influential_vals[lh_dfbeta > cutoff_beta, c(1:6, 17)] #what observations they were
lh_prop_red = 1-sum((lh_model$y-lh_model$fitted.values)^2)/sum((lh_model$y-mean(lh_model$y))^2)
lh_prop_red
# MIDDLE HIGH: OUTLIERS AND INFLUENTIAL POINTS
mh_influential_vals = dx(mh_model)
mh_pearsons = mh_influential_vals$Pr #Pearsons Residuals
mh_deviance = mh_influential_vals$dr #Deviance Residuals
mh_stdresid = mh_influential_vals$sPr #Standardized residuals (Pearson)
mh_dfbeta = mh_influential_vals$dBhat #DF Beta for removing each observation
mh_pearson_change = mh_influential_vals$dChisq #Change in pearson X^2 for each observation
mh_likelihood_change = mh_influential_vals$dDev #Change in LR-test G^2 for each observation
for_hist = data.frame(mh_stdresid)
hist_lh_stdresid = ggplot(data = for_hist, aes(x = mh_stdresid))
hist_lh_stdresid + geom_histogram(bins=10) + ggtitle("Standardized Residuals for Low to High")
cat("Mean: ", mean(mh_stdresid), "\nVariance: ", var(mh_stdresid))
cutoff_beta = 0.20
values = mh_dfbeta[mh_dfbeta > cutoff_beta] #Shows the values
mh_influential_vals[mh_dfbeta > cutoff_beta, c(1:6, 17)] #what observations they were
mh_prop_red = 1-sum((mh_model$y-mh_model$fitted.values)^2)/sum((mh_model$y-mean(mh_model$y))^2)
mh_prop_red
# PREDICTIONS
# Middle class
predict_values = data.frame(gender=c("female"), schtyp=c("private"), read=60, write=60, math=60, science=60, honors=c("no"))
predict(best_model, predict_values, type = "probs")
# ERROR MATRIX
fits = predict(best_model) 
error_matrix = table(actual=test_data$ses, preds = predict(best_model))
error_matrix

# ERROR RATE
error_rate = 1 - sum(diag(error_matrix))/sum(error_matrix)
error_rate
# EXPLORING LOW SOCIOECONOMIC CLASS
low_ses = student[student$ses=="low",]
low_ses_table = table(low_ses[,c("gender", "schtyp", "honors")])
mosaicplot(low_ses_table)
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
read_low = ggplot(low_ses, aes(x = read))
print(read_low + geom_histogram(bins=10) + ggtitle("Reading Scores (Low SES)"), vp=vplayout(1, 1))
write_low = ggplot(low_ses, aes(x = write))
print(write_low + geom_histogram(bins=10) + ggtitle("Writing Scores (Low SES)"), vp=vplayout(1, 2))
science_low = ggplot(low_ses, aes(x = science))
print(science_low + geom_histogram(bins=10) + ggtitle("Science Scores (Low SES)"), vp=vplayout(2, 1))
math_low = ggplot(low_ses, aes(x = math))
print(math_low + geom_histogram(bins=10) + ggtitle("Math Scores (Low SES)"), vp=vplayout(2, 2))
# EXPLORING MIDDLE SOCIOECONOMIC CLASS
middle_ses = student[student$ses=="middle",]
middle_ses_table = table(middle_ses[,c("gender", "schtyp", "honors")])
mosaicplot(middle_ses_table)
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
read_middle = ggplot(middle_ses, aes(x = read))
print(read_middle + geom_histogram(bins=10) + ggtitle("Reading Scores (Middle SES)"), vp=vplayout(1, 1))
write_middle = ggplot(middle_ses, aes(x = write))
print(write_middle + geom_histogram(bins=10) + ggtitle("Writing Scores (Middle SES)"), vp=vplayout(1, 2))
science_middle = ggplot(middle_ses, aes(x = science))
print(science_middle + geom_histogram(bins=10) + ggtitle("Science Scores (Middle SES)"), vp=vplayout(2, 1))
math_middle = ggplot(middle_ses, aes(x = math))
print(math_middle + geom_histogram(bins=10) + ggtitle("Math Scores (Middle SES)"), vp=vplayout(2, 2))

Final Project, Problem 1