We've been studying study design and sampling methods. We've seen how bias comes into play. Now, imagine we have a great dataset that is representative of whatever population we are interested in.
library(readr)
library(dplyr)
birth_data <- read_csv("data/us-territory-births.csv")[,-1]
head(birth_data)
This was provided from the data documentation.
Variable | Description |
---|---|
babyID | Unique identifier: row number |
dbwt | Birth weight in Grams: 0227-8165 Number of grams |
combgest | Combined Gestation - in Weeks: 17th to 47th week of Gestation |
sex | Sex of Infant: M (Male) or F (Female) |
dob_mm | Birth month of the infant |
cig_rec | If the mother reports smoking in any of the three trimesters of pregnancy she is classified as a smoker: (Y) Yes, (N) No, or (U) Unknown |
For us, what is important right now is dbwt
which is the birth weight. We're in the US, so we don't normally record babies' birth weight in grams.
install.packages("measurements")
library(measurements)
birth_data <- birth_data %>% mutate(bw_lbs = conv_unit(dbwt,"g","lbs"))
head(birth_data)
birth_data <- birth_data %>% mutate(under_5lbs = bw_lbs < 5)
head(birth_data) # have a look at the added variable
Let's first calculate the true mean. We are going to estimate the true mean using samples of different sizes to estimate this true population parameter.
birth_data %>% summarize(proportion=mean(under_5lbs))
Let's take a sample of size 10. Check out how the proportion in this case is 0! That is, a sample size of 10 could by chance show us that no babies were born under 5 lbs in our population of interest!
sample_10 <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
sample_10 %>% summarize(proportion=mean(under_5lbs))
Now, let's take a sample of size 100. The proportion is looking around 5%.
sample_100 <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
sample_100 %>% summarize(proportion=mean(under_5lbs))
Now, of size 1000, we see the proportion drifting a bit by chance away from the true mean.
sample_1000 <- birth_data %>% sample_n(size = 1000) %>% mutate(sample_size = n())
sample_1000 %>% summarize(proportion=mean(under_5lbs))
We are now motivated to fill out the following table to figure out how the different combinations of sample size $n$ and number of samples taken changes our estimates. We have the true population proportion from above of babies who were born under 5 lbs.
Value | Number |
---|---|
True Proportion | 0.5233 |
10 samples of n=10 | |
100 samples of n=10 | |
10 samples of n=100 | |
100 samples of n=100 |
set.seed(1007)
n_10_samples <- rep(0, 10)
for (ix in 1:10) {
this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))
n_10_samples %>% summarize(estimated_proportion=mean(samples))
Now, we can update the table...
Value | Number |
---|---|
True Proportion | 0.5233 |
10 samples of n=10 | 0.070 |
100 samples of n=10 | |
10 samples of n=100 | |
100 samples of n=100 |
Let's take a bunch of samples (100 samples total) of size 10.
set.seed(1007)
n_10_samples <- rep(0, 100)
for (ix in 1:100) {
this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))
n_10_samples %>% summarize(estimated_proportion=mean(samples))
Now, we can update the table...
Value | Number |
---|---|
True Proportion | 0.5233 |
10 samples of n=10 | 0.070 |
100 samples of n=10 | 0.056 |
10 samples of n=100 | |
100 samples of n=100 |
set.seed(1007)
n_100_samples <- rep(0, 10)
for (ix in 1:10) {
this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))
n_100_samples %>% summarize(estimated_proportion=mean(samples))
Now, we can update the table...
Value | Number |
---|---|
True Proportion | 0.5233 |
10 samples of n=10 | 0.070 |
100 samples of n=10 | 0.056 |
10 samples of n=100 | 0.058 |
100 samples of n=100 |
Let's take a bunch of samples (100 samples total) of size 100.
set.seed(1007)
n_100_samples <- rep(0, 100)
for (ix in 1:100) {
this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))
n_100_samples %>% summarize(estimated_proportion=mean(samples))
Now, we can update the table...
Value | Number |
---|---|
True Proportion | 0.5233 |
10 samples of n=10 | 0.070 |
100 samples of n=10 | 0.056 |
10 samples of n=100 | 0.058 |
100 samples of n=100 | 0.0533 |
Do you see how increasing the number of samples will get better estimates and how furthermore increasing the sample size will also get better estimates? Try searching for more patterns on your own.