Lab 3. Sample sizes and numbers of samples, oh my!

We've been studying study design and sampling methods. We've seen how bias comes into play. Now, imagine we have a great dataset that is representative of whatever population we are interested in.

free samples

Goals

  1. Understand sample from population
  2. Take estimate from a sampling distribution
  3. As $n$ increases, your sampling distribution becomes less variant and closer to the true value (see on a histogram)
In [2]:
library(readr)
library(dplyr)
In [3]:
birth_data <- read_csv("data/us-territory-births.csv")[,-1]
head(birth_data)
Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  babyID = col_integer(),
  dbwt = col_integer(),
  combgest = col_integer(),
  sex = col_character(),
  dob_mm = col_integer(),
  cig_rec = col_character()
)
babyIDdbwtcombgestsexdob_mmcig_rec
1 297737 M 1 N
2 319141 M 1 Y
3 178632 F 1 N
4 448939 M 1 N
5 320338 M 1 N
6 320339 F 1 N

Data dictionary

This was provided from the data documentation.

Variable Description
babyID Unique identifier: row number
dbwt Birth weight in Grams: 0227-8165 Number of grams
combgest Combined Gestation - in Weeks: 17th to 47th week of Gestation
sex Sex of Infant: M (Male) or F (Female)
dob_mm Birth month of the infant
cig_rec If the mother reports smoking in any of the three trimesters of pregnancy she is classified as a smoker: (Y) Yes, (N) No, or (U) Unknown

For us, what is important right now is dbwt which is the birth weight. We're in the US, so we don't normally record babies' birth weight in grams.

In [4]:
install.packages("measurements")
library(measurements)
birth_data <- birth_data %>% mutate(bw_lbs = conv_unit(dbwt,"g","lbs"))
head(birth_data)
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
babyIDdbwtcombgestsexdob_mmcig_recbw_lbs
1 2977 37 M 1 N 6.563161
2 3191 41 M 1 Y 7.034950
3 1786 32 F 1 N 3.937455
4 4489 39 M 1 N 9.896550
5 3203 38 M 1 N 7.061405
6 3203 39 F 1 N 7.061405
In [5]:
birth_data <- birth_data %>% mutate(under_5lbs = bw_lbs < 5)
head(birth_data) # have a look at the added variable
babyIDdbwtcombgestsexdob_mmcig_recbw_lbsunder_5lbs
1 2977 37 M 1 N 6.563161FALSE
2 3191 41 M 1 Y 7.034950FALSE
3 1786 32 F 1 N 3.937455 TRUE
4 4489 39 M 1 N 9.896550FALSE
5 3203 38 M 1 N 7.061405FALSE
6 3203 39 F 1 N 7.061405FALSE

Let's first calculate the true mean. We are going to estimate the true mean using samples of different sizes to estimate this true population parameter.

In [14]:
birth_data %>% summarize(proportion=mean(under_5lbs))
proportion
0.05233635

Taking single samples from our dataset

Let's take a sample of size 10. Check out how the proportion in this case is 0! That is, a sample size of 10 could by chance show us that no babies were born under 5 lbs in our population of interest!

In [6]:
sample_10 <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
In [8]:
sample_10 %>% summarize(proportion=mean(under_5lbs))
proportion
0

Now, let's take a sample of size 100. The proportion is looking around 5%.

In [10]:
sample_100 <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
In [11]:
sample_100 %>% summarize(proportion=mean(under_5lbs))
proportion
0.05

Now, of size 1000, we see the proportion drifting a bit by chance away from the true mean.

In [12]:
sample_1000 <- birth_data %>% sample_n(size = 1000) %>% mutate(sample_size = n())
In [13]:
sample_1000 %>% summarize(proportion=mean(under_5lbs))
proportion
0.056

How well do these sample sizes estimate the true value?

We are now motivated to fill out the following table to figure out how the different combinations of sample size $n$ and number of samples taken changes our estimates. We have the true population proportion from above of babies who were born under 5 lbs.

Value Number
True Proportion 0.5233
10 samples of n=10
100 samples of n=10
10 samples of n=100
100 samples of n=100

Sample sizes of n=10

10 samples of n=10

In [64]:
set.seed(1007)
n_10_samples <- rep(0, 10)
for (ix in 1:10) {
    this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))
In [65]:
n_10_samples %>% summarize(estimated_proportion=mean(samples))
estimated_proportion
0.07

Now, we can update the table...

Value Number
True Proportion 0.5233
10 samples of n=10 0.070
100 samples of n=10
10 samples of n=100
100 samples of n=100

100 samples of n=10

Let's take a bunch of samples (100 samples total) of size 10.

In [66]:
set.seed(1007)
n_10_samples <- rep(0, 100)
for (ix in 1:100) {
    this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))
In [67]:
n_10_samples %>% summarize(estimated_proportion=mean(samples))
estimated_proportion
0.056

Now, we can update the table...

Value Number
True Proportion 0.5233
10 samples of n=10 0.070
100 samples of n=10 0.056
10 samples of n=100
100 samples of n=100

Sample sizes of n=100

10 samples of n=100

In [68]:
set.seed(1007)
n_100_samples <- rep(0, 10)
for (ix in 1:10) {
    this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))
In [69]:
n_100_samples %>% summarize(estimated_proportion=mean(samples))
estimated_proportion
0.058

Now, we can update the table...

Value Number
True Proportion 0.5233
10 samples of n=10 0.070
100 samples of n=10 0.056
10 samples of n=100 0.058
100 samples of n=100

100 samples of n=100

Let's take a bunch of samples (100 samples total) of size 100.

In [70]:
set.seed(1007)
n_100_samples <- rep(0, 100)
for (ix in 1:100) {
    this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))
In [71]:
n_100_samples %>% summarize(estimated_proportion=mean(samples))
estimated_proportion
0.0533

Now, we can update the table...

Value Number
True Proportion 0.5233
10 samples of n=10 0.070
100 samples of n=10 0.056
10 samples of n=100 0.058
100 samples of n=100 0.0533

Do you see how increasing the number of samples will get better estimates and how furthermore increasing the sample size will also get better estimates? Try searching for more patterns on your own.