Getting population parameters

First, we load in libraries as normal.

library(ggplot2)
library(dplyr)
library(readr)
library(knitr)

Let’s check out the census data we have on Alameda.

alameda <- read.csv("../data/alameda.csv")
head(alameda)
##   birth_loc num_sibs hosp_visit   height
## 1         0        1          0 72.82902
## 2         0        7          0 69.98090
## 3         0        2          0 69.34478
## 4         1        0          0 75.50710
## 5         0        0          0 67.88378
## 6         0        2          0 69.55474

Notice how right-skewed the following true distribution is.

# * CALCULATE TRUE MEAN SIBS
mean_sibs <- alameda %>% summarize(mean_sibs=mean(num_sibs)) %>% pull(mean_sibs)

# * VISUALIZE TRUE DISTRIBUTION OF MEAN NUMBER OF SIBLINGS
ggplot(data=alameda, aes(x=num_sibs)) +
  geom_bar(col="white", lwd=0.2) + 
  geom_vline(xintercept=mean_sibs, col="royalblue") +
  ggtitle("True distribution of number of siblings")

Now, we’re going to make sampling distributions. Do not confuse these with distributions of samples! We are taking a statistic of a sample then plotting it on a histogram! We are not taking a sample, then making a histogram of the sample! The shapes would be entirely different.

Calculating sample statistics

To get a sample of size 5 from the Alameda data, we use sample_n().

sample_5 <- alameda %>% sample_n(5)
sample_5
##      birth_loc num_sibs hosp_visit   height
## 3553         0        0          0 65.57817
## 9791         1        2          0 69.04170
## 1585         0        2          0 70.01666
## 1121         0        0          0 71.78479
## 7450         1        1         20 68.17404

Then, we take the sample statistic of the mean number of siblings. This value may not properly reflect the true population mean number of siblings which is 1.18 siblings, as per above.

sample_5 %>% summarize(mean_sibs=mean(num_sibs))
##   mean_sibs
## 1         1

Now, we’re going to show you how if we take many many samples of size \(n={5, 50, 500}\) how our sampling distribution of the sample mean \(\bar{X}_n\) would change. We will be taking a total of 100 samples for each of the sample sizes. As you look at the following GIF’s (is it jiff or ghiff? :P), think about the following questions:

  1. How does the variance change when we increase the sample size?
  2. How does the distributional shape change when the number of samples increases?
  3. How does the distributional shape change when we increase the sample size?

Sampling distribution with n=5

Check out the animation for \(n=5\). Notice that we have quite the spread/variance, and our estimates aren’t so awesome, but the overall mean looks like it is close to the true mean number of siblings which is 1.18.

Sampling distribution with n=50

The animation for \(n=50\) looks like this.

Sampling distribution with n=500

The animation for \(n=500\) looks like this.

To have a clearer look at the normal distribution, we can make the binwidth smaller.