22. San Francisco, CA.
I've only been programming for a year, but it already feels like a lifetime. Palautatan means "statistics" in Tagalog.

View Home Page | View My GitHub Profile
Final Project, Problem 2

The Baby Dataset

The baby dataset baby.csv contains 7 measurements for 189 women who gave birth. The measurements taken were: age, weight, whether they were smokers, whether they had pre-mature labor, whether they had history of hypertension, the amount of visits they had during their first trimester, and whether or not their child was born with low birth weight.

Displayed below are a mosaic plot of the categorical data of the mothers in our dataset. This mosaic plot shows a mainly uniform distribution of the categorical variables. However, we do see that if a mother is a nonsmoker and has had pre-mature labor, this would imply they probably had a history of hypertension.

The average age of a mother in this dataset is 23.2381 years.

## [1] 23.2381

The average weight of a mother in this dataset is 129.8148 pounds.

## [1] 129.8148

Most of these women did not visit the hospital during their first trimester. If they did visit the hospital, they were most likely to have just visited once.

## 
##   0   1   2   3   4   6 
## 100  47  30   7   4   1

Fitting the Model

Our goal in creating a model to baby.csv is to predict whether a baby is born with a low birth weight based on the variables named above. In order to fit a model optimal for prediction to baby, we used Backward-Forward Model Selection using AIC as our criterion.

The best fit model was low ~ age + weight + smoke + pre + hyp with the following estimated coefficients with and AIC of 214.2. This model had the lowest AIC as a result of the Backward-Forward Model Selection procedure.

##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -2.03196863 1.111157032 -1.828696 0.0674451342
## age          0.06031852 0.036316943  1.660892 0.0967351619
## weight       0.01615440 0.006923008  2.333436 0.0196252683
## smokeyes    -0.51836638 0.348309029 -1.488237 0.1366884063
## preyes      -1.79403783 0.508841234 -3.525732 0.0004223141
## hypyes      -1.78270977 0.716698319 -2.487392 0.0128683464

When a specific woman’s age increases by 1, the odds of the child being born with low birth weight is multiplied by \(exp(0.06032)=1.062176\), all other variables held constant.

When a specific woman’s weight increases by 1, the odds of the child being born with low birth weight is multiplied by \(exp(0.01615)=1.016281\), all other variables held constant.

The odds of the child being born with low birth weight when the mother smokes is \(exp(-0.51837)=0.5954904\) times when the mother does not smoke, all other variables held constant.

The odds of the child being born with low birth weight when the mother has pre-mature labor is \(exp(-1.79404)=0.166287\) times when the mother does not have pre-mature labor, all other variables held constant.

The odds of the child being born with low birth weight when the mother has a history of hypertension is \(exp(-1.78271)=0.1681818\) times when the mother does not have a hypertension history, all other variables held constant.

Diagnostics

Standardized Residuals If our model fits our data well, then the residuals should be distributed as standard normal.

## Mean:  -0.02750199 
## Variance:  1.090097

The mean and variance of these residuals are -0.02750199 and 1.090097 respectively which are not significantly deviant from what we would expect if the model fit our data well. According to the histogram, the residuals have a bimodal distribution. (One peak is more dense than the other.) Despite this, we believe that our model fits our data relatively well.

Dfbeta Using Dfbeta with a threshold of 0.2, we were able to find that the following observations were influential:

-19 year old smoker without pre-mature labor and with history of hypertension weighing 184 pounds -24 year old smoker with pre-mature labor and without history of hypertension weighing 90 pounds -23 year old smoker without pre-mature labor and without history of hypertension weighing 187 pounds -26 year old smoker without pre-mature labor and without history of hypertension weighing 190 pounds -22 year old nonsmoker without pre-mature labor and with history of hypertension weighing 120 pounds -21 year old nonsmoker without pre-mature labor and without history of hypertension weighing 200 pounds -34 year old smoker without pre-mature labor and with history of hypertension weighing 187 pounds -22 year old nonsmoker without pre-mature labor and with history of hypertension weighing 95 pounds

##    (Intercept) age weight smokeyes preyes hypyes     dBhat
## 1:           1  19    184        1      0      1 0.2022474
## 2:           1  24     90        1      1      0 0.2038941
## 3:           1  23    187        1      0      0 0.2295923
## 4:           1  26    190        1      0      0 0.2631358
## 5:           1  22    120        0      0      1 0.2672501
## 6:           1  21    200        0      0      0 0.3197838
## 7:           1  34    187        1      0      1 0.4075552
## 8:           1  22     95        0      0      1 0.4380299

Proportional Reduction in Squared Error Our value of Proportional Reduction in Squared Error is \(0.171218<<1\). This suggests that when we add our X variables to our model, the predictive power does not significantly change from the baseline \(\bar{y}\). Even though we used Backward-Forward model selection with AIC, it is apparent that our best model has low predictive power.

## [1] 0.171218

Prediction

A woman aged 29 weighing 157 pounds who does not smoke, did not have pre-mature labor, has no history of hypertension, and had 10 visits is predicted to have a baby with low birthweight.

##         1 
## 0.9049528

For all predictions, we have created an error matrix. Our model seems to predict low birth weight more accurately than not low birth weight.

##       preds
## actual   0   1
##      0  21  38
##      1  10 120

The error rate was 25.39%.

## [1] 0.2539683

Conclusion

When the best model was found, the only variable that was dropped was number of visits during the first trimester. As both age and weight increase, the probability of low birth weight also increases, whereas when the mother smokes, has a history of pre-mature labor, or has a history of hypertension, the probability of low birth weight decreases. We are deeply disturbed by these results. We also have found eight influential points and our model’s predictive power is very low. This is further validated by the atrociously high error rate found in the last section.

Appendix C: Problem 2 Code

knitr::opts_chunk$set(echo = TRUE, fig.align="center")
# LIBRARIES
library(foreign)
library(nnet)
library(ggplot2)
library(vcd)
library(grid)
library(LogisticDx)

# BABY DATASET
baby = read.csv("baby.csv", as.is=TRUE)
# MOSAIC PLOT
mosaicplot(table(baby[,3:5]))
# AGE HISTOGRAM
age_hist = ggplot(baby, aes(x = age))
age_hist + geom_histogram(bins=25) + ggtitle("Age")
mean(baby$age)
# WEIGHT HISTOGRAM
weight_hist = ggplot(baby, aes(x = weight))
weight_hist + geom_histogram(bins=25) + ggtitle("Weight")
mean(baby$weight)
# VISITS HISTOGRAM
visits_hist = ggplot(baby, aes(x = visits))
visits_hist + geom_histogram(bins=6) + ggtitle("Visits")
table(baby$visits)
# FITTING MODEL
full_model = glm(low ~ . , data = baby, family = binomial(link=logit))
empty_model = glm(low ~ 1 ,data = baby,family = binomial(link=logit))
# BACKWARD FORWARD
bf_aic = step(full_model, scope = list(lower = empty_model, upper = full_model),direction = "both", criterion = "AIC", trace = FALSE)
summary(bf_aic)$coefficients
# DX
influential_values = dx(bf_aic)

# STANDARDIZED RESIDUALS
stdresid = influential_values$sPr #Standardized residuals (Pearson)
for_hist = data.frame(stdresid)
hist_stdresid = ggplot(data = for_hist, aes(x = stdresid))
hist_stdresid + geom_histogram(bins=15) + ggtitle("Standardized Residuals")
cat("Mean: ", mean(stdresid), "\nVariance: ", var(stdresid))
# DFBETA
dfbeta = influential_values$dBhat #DF Beta for removing each observation
cutoff_beta = 0.20
values = dfbeta[dfbeta > cutoff_beta] #Shows the values
influential_values[dfbeta > cutoff_beta, c(1:6, 18)] #what observations they were
# PROPORTIONAL REDUCTION IN ERROR
prop_red = 1-sum((bf_aic$y-bf_aic$fitted.values)^2)/sum((bf_aic$y-mean(bf_aic$y))^2)
prop_red
# PREDICTIONS
predict_values = data.frame(age=29, weight=157, smoke="no", pre="no", hyp="no", visits=10)
predict(bf_aic, predict_values, type="response")
# ERROR MATRIX
fits = predict(bf_aic, type="response")
fits = ifelse(fits>0.5, 1, 0)
error_matrix = table(actual=baby$low, preds=fits)
error_matrix
# ERROR RATE
error_rate = 1 - sum(diag(error_matrix))/sum(error_matrix)
error_rate
View Home Page | View My GitHub Profile