Lab 3. Sample sizes and numbers of samples, oh my!¶

We've been studying study design and sampling methods. We've seen how bias comes into play. Now, imagine we have a great dataset that is representative of whatever population we are interested in.

free samples

Goals¶

Understand sample from population
Take estimate from a sampling distribution
As $n$ increases, your sampling distribution becomes less variant and closer to the true value (see on a histogram)

library(readr)
library(dplyr)

birth_data <- read_csv("data/us-territory-births.csv")[,-1]
head(birth_data)

Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  babyID = col_integer(),
  dbwt = col_integer(),
  combgest = col_integer(),
  sex = col_character(),
  dob_mm = col_integer(),
  cig_rec = col_character()
)

Data dictionary¶

This was provided from the data documentation.

Variable	Description
babyID	Unique identifier: row number
dbwt	Birth weight in Grams: 0227-8165 Number of grams
combgest	Combined Gestation - in Weeks: 17th to 47th week of Gestation
sex	Sex of Infant: M (Male) or F (Female)
dob_mm	Birth month of the infant
cig_rec	If the mother reports smoking in any of the three trimesters of pregnancy she is classified as a smoker: (Y) Yes, (N) No, or (U) Unknown

For us, what is important right now is dbwt which is the birth weight. We're in the US, so we don't normally record babies' birth weight in grams.

install.packages("measurements")
library(measurements)
birth_data <- birth_data %>% mutate(bw_lbs = conv_unit(dbwt,"g","lbs"))
head(birth_data)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

birth_data <- birth_data %>% mutate(under_5lbs = bw_lbs < 5)
head(birth_data) # have a look at the added variable

Let's first calculate the true mean. We are going to estimate the true mean using samples of different sizes to estimate this true population parameter.

birth_data %>% summarize(proportion=mean(under_5lbs))

Taking single samples from our dataset¶

Let's take a sample of size 10. Check out how the proportion in this case is 0! That is, a sample size of 10 could by chance show us that no babies were born under 5 lbs in our population of interest!

sample_10 <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())

sample_10 %>% summarize(proportion=mean(under_5lbs))

Now, let's take a sample of size 100. The proportion is looking around 5%.

sample_100 <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())

sample_100 %>% summarize(proportion=mean(under_5lbs))

Now, of size 1000, we see the proportion drifting a bit by chance away from the true mean.

sample_1000 <- birth_data %>% sample_n(size = 1000) %>% mutate(sample_size = n())

sample_1000 %>% summarize(proportion=mean(under_5lbs))

How well do these sample sizes estimate the true value?¶

We are now motivated to fill out the following table to figure out how the different combinations of sample size $n$ and number of samples taken changes our estimates. We have the true population proportion from above of babies who were born under 5 lbs.

Value	Number
True Proportion	0.5233
10 samples of n=10
100 samples of n=10
10 samples of n=100
100 samples of n=100

Sample sizes of n=10¶

10 samples of n=10¶

set.seed(1007)
n_10_samples <- rep(0, 10)
for (ix in 1:10) {
    this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))

n_10_samples %>% summarize(estimated_proportion=mean(samples))

Now, we can update the table...

Value	Number
True Proportion	0.5233
10 samples of n=10	0.070
100 samples of n=10
10 samples of n=100
100 samples of n=100

100 samples of n=10¶

Let's take a bunch of samples (100 samples total) of size 10.

set.seed(1007)
n_10_samples <- rep(0, 100)
for (ix in 1:100) {
    this_sample <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_10_samples[ix] <- this_proportion
}
n_10_samples <- data.frame(samples=as.numeric(matrix(n_10_samples)))

n_10_samples %>% summarize(estimated_proportion=mean(samples))

Now, we can update the table...

Value	Number
True Proportion	0.5233
10 samples of n=10	0.070
100 samples of n=10	0.056
10 samples of n=100
100 samples of n=100

Sample sizes of n=100¶

10 samples of n=100¶

set.seed(1007)
n_100_samples <- rep(0, 10)
for (ix in 1:10) {
    this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))

n_100_samples %>% summarize(estimated_proportion=mean(samples))

Now, we can update the table...

Value	Number
True Proportion	0.5233
10 samples of n=10	0.070
100 samples of n=10	0.056
10 samples of n=100	0.058
100 samples of n=100

100 samples of n=100¶

Let's take a bunch of samples (100 samples total) of size 100.

set.seed(1007)
n_100_samples <- rep(0, 100)
for (ix in 1:100) {
    this_sample <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())
    this_proportion <- this_sample %>% summarize(proportion=mean(under_5lbs))
    n_100_samples[ix] <- this_proportion
}
n_100_samples <- data.frame(samples=as.numeric(matrix(n_100_samples)))

n_100_samples %>% summarize(estimated_proportion=mean(samples))

Now, we can update the table...

Value	Number
True Proportion	0.5233
10 samples of n=10	0.070
100 samples of n=10	0.056
10 samples of n=100	0.058
100 samples of n=100	0.0533

Do you see how increasing the number of samples will get better estimates and how furthermore increasing the sample size will also get better estimates? Try searching for more patterns on your own.

babyID	dbwt	combgest	sex	dob_mm	cig_rec
1	2977	37	M	1	N
2	3191	41	M	1	Y
3	1786	32	F	1	N
4	4489	39	M	1	N
5	3203	38	M	1	N
6	3203	39	F	1	N

babyID	dbwt	combgest	sex	dob_mm	cig_rec	bw_lbs
1	2977	37	M	1	N	6.563161
2	3191	41	M	1	Y	7.034950
3	1786	32	F	1	N	3.937455
4	4489	39	M	1	N	9.896550
5	3203	38	M	1	N	7.061405
6	3203	39	F	1	N	7.061405

babyID	dbwt	combgest	sex	dob_mm	cig_rec	bw_lbs	under_5lbs
1	2977	37	M	1	N	6.563161	FALSE
2	3191	41	M	1	Y	7.034950	FALSE
3	1786	32	F	1	N	3.937455	TRUE
4	4489	39	M	1	N	9.896550	FALSE
5	3203	38	M	1	N	7.061405	FALSE
6	3203	39	F	1	N	7.061405	FALSE