First, we load in libraries as normal.
library(ggplot2)
library(dplyr)
library(readr)
library(knitr)
Let’s check out the census data we have on Alameda.
alameda <- read.csv("../data/alameda.csv")
head(alameda)
## birth_loc num_sibs hosp_visit height
## 1 0 1 0 72.82902
## 2 0 7 0 69.98090
## 3 0 2 0 69.34478
## 4 1 0 0 75.50710
## 5 0 0 0 67.88378
## 6 0 2 0 69.55474
Notice how right-skewed the following true distribution is.
# * CALCULATE TRUE MEAN SIBS
mean_sibs <- alameda %>% summarize(mean_sibs=mean(num_sibs)) %>% pull(mean_sibs)
# * VISUALIZE TRUE DISTRIBUTION OF MEAN NUMBER OF SIBLINGS
ggplot(data=alameda, aes(x=num_sibs)) +
geom_bar(col="white", lwd=0.2) +
geom_vline(xintercept=mean_sibs, col="royalblue") +
ggtitle("True distribution of number of siblings")
Now, we’re going to make sampling distributions. Do not confuse these with distributions of samples! We are taking a statistic of a sample then plotting it on a histogram! We are not taking a sample, then making a histogram of the sample! The shapes would be entirely different.
To get a sample of size 5 from the Alameda data, we use sample_n()
.
sample_5 <- alameda %>% sample_n(5)
sample_5
## birth_loc num_sibs hosp_visit height
## 3553 0 0 0 65.57817
## 9791 1 2 0 69.04170
## 1585 0 2 0 70.01666
## 1121 0 0 0 71.78479
## 7450 1 1 20 68.17404
Then, we take the sample statistic of the mean number of siblings. This value may not properly reflect the true population mean number of siblings which is 1.18 siblings, as per above.
sample_5 %>% summarize(mean_sibs=mean(num_sibs))
## mean_sibs
## 1 1
Now, we’re going to show you how if we take many many samples of size \(n={5, 50, 500}\) how our sampling distribution of the sample mean \(\bar{X}_n\) would change. We will be taking a total of 100 samples for each of the sample sizes. As you look at the following GIF’s (is it jiff or ghiff? :P), think about the following questions:
Check out the animation for \(n=5\). Notice that we have quite the spread/variance, and our estimates aren’t so awesome, but the overall mean looks like it is close to the true mean number of siblings which is 1.18.
The animation for \(n=50\) looks like this.
The animation for \(n=500\) looks like this.
To have a clearer look at the normal distribution, we can make the binwidth smaller.