The permutation test is based off of simulating a null distribution. From that null distribution, we are going to calculate a p-value.
Say the hypotheses we wish to test are \[ H_0: \mu_1=\mu_2 \text{ versus } H_1: \mu_1 \neq \mu_2 \]
for some population 1 and 2.
Your null distribution is governed by your null hypothesis. Our null says, “Mean 1 equals mean 2”. So, we say, “Yes, sir, Ms. Null. Mean 1 will equal mean 2 in our simulation.”
For a given dataset, if the means of both of the distributions were the same, then any of the data we have could have been from either population 1 or population 2. Imagine you have two datasets pulled from \(N(0,0.5)\) distributions, i.e. we know that \(\mu_1=\mu_2\).
We are going to show that we can use the steps in a permutation test with the expectation of failing to reject the null hypothesis.
set.seed(142)
sample_1 <- rnorm(n=10, mean=0, sd=0.5)
sample_2 <- rnorm(n=10, mean=0, sd=0.5)
samples <- data.frame(cbind(sample_1, sample_2))
samples
## sample_1 sample_2
## 1 0.62736043 0.73789956
## 2 0.85204271 -0.81809309
## 3 0.44016151 0.64780971
## 4 -0.29077247 -0.06890457
## 5 -0.08459428 0.24827375
## 6 0.53238943 0.17254728
## 7 0.35918774 0.21900323
## 8 0.01912117 0.06949258
## 9 -0.52483197 -0.68411150
## 10 0.10615675 0.29494909
Let’s melt
the dataframe into long format where instead of having two columns corresponding to each sample, we will have one column denoting whether the value came from which sample and the value itself.
library(reshape2)
long_samples <- melt(samples, id.vars=NULL)
library(dplyr)
long_samples %>% sample_n(10)
## variable value
## 1 sample_1 0.85204271
## 2 sample_1 -0.08459428
## 3 sample_1 0.44016151
## 4 sample_1 0.53238943
## 5 sample_2 0.21900323
## 6 sample_1 0.10615675
## 7 sample_2 0.64780971
## 8 sample_2 0.06949258
## 9 sample_1 0.01912117
## 10 sample_2 -0.81809309
Let’s visualize this now.
library(ggplot2)
ggplot(long_samples, aes(x=variable, y=value, fill=variable)) +
geom_boxplot(lwd=0.4)
ggplot(long_samples, aes(x=value, fill=variable)) +
geom_histogram(binwidth=0.1, col="black", lwd=0.2)
ggplot(long_samples, aes(x=value, fill=variable)) +
geom_histogram(binwidth=0.1, col="black", lwd=0.2) +
facet_wrap(~variable)
Both of the distributions are close to 0, but sample 2 has wider variance. This is not due to the underlying distribution. Both of them were simulated from \(N(0,0.5)\). This is due to sampling (sample size is only 10 for both). Based on our dataset, we see these statistics.
long_samples %>% group_by(variable) %>% summarize(x_bar=mean(value), s=sd(value))
## # A tibble: 2 x 3
## variable x_bar s
## <fct> <dbl> <dbl>
## 1 sample_1 0.204 0.433
## 2 sample_2 0.0819 0.502
Typically, in order to test something like the above, we’d want to use a two sample t-test for the comparison of two means.
Recall, for your comparison as we continue, a t-test we would have:
\[ T= \frac{\bar{x_1}-\bar{x_2}}{\hat{SE}(\bar{x_1}-\bar{x_2})} \]
We’ll take this time now to calculate our observed mean difference.
means <- long_samples %>% group_by(variable) %>% summarize(mean(value)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff
## [1] -0.1217355
Because we have a low sample size, our \(\bar{x}\) is expected to be imprecise for both of our samples. (And they are; 0.532 and -0.206 are not demonstrative of \(\mu=0\) which we set to be true at the beginning of the dataset.)
If it is true that \(\mu_1=\mu_2\), then we can shuffle our labels on our data. I am going to use this function to generate new labels.
# * FUNCTION TO CREATE LABELS
generate_labels <- function(labels, repetitions) {
relabels <- do.call(cbind,
lapply(1:repetitions, function(x)
sample(as.character(labels),
length(labels),
replace=FALSE)))
relabels <- data.frame(data=relabels)
names(relabels) <- paste0("relabel_", 1:repetitions)
relabels
}
# * APPLY FUNCTION
relabels <- generate_labels(long_samples %>% pull(variable), 1000)
head(relabels[,1:6])
## relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1 sample_1 sample_1 sample_2 sample_2 sample_2 sample_1
## 2 sample_2 sample_2 sample_1 sample_2 sample_1 sample_1
## 3 sample_1 sample_2 sample_1 sample_2 sample_1 sample_1
## 4 sample_1 sample_1 sample_2 sample_1 sample_1 sample_2
## 5 sample_2 sample_2 sample_2 sample_1 sample_1 sample_2
## 6 sample_1 sample_1 sample_1 sample_2 sample_2 sample_1
How does this help? Well, now we can try to make a null distribution. We want to take the sample means from both of these values and subtract one from another because…
\[ H_0: \mu_1=\mu_2 \implies H_0: \mu_2-\mu_1=0 \]
Remember, if our null were true, then we would see our distribution wind up around a near 0 center. We’ll use the below function.
# * FUNCTION TO CALCULATE SAMPLE MEAN DIFFERENCES
calculate_sample_mean_diffs <- function(value_col, relabel_df) {
# * CREATE A "NOTE TAKER"
calculated_diffs <- c()
# * FOR EACH OF THE RELABELS
for (ix in 1:ncol(relabel_df)) {
# * CALCULATE THE DIFFERENCE IN MEANS
tmp_df <- data.frame(cbind(value=value_col, relabel=relabel_df[,ix]))
means <- tmp_df %>% group_by(relabel) %>% summarize(mean(value)) %>% pull()
calculated_diffs <- c(calculated_diffs, means[2]-means[1])
}
calculated_diffs
}
# * APPLYING THE FUNCTION
calculated_diffs <- calculate_sample_mean_diffs(long_samples %>% select(value), relabels)
So here’s our simulated null distribution. So, normally we would have calculated some test statistic and compare it to a known distribution like normal, student-t, F distribution, or chi-squared to derive a p-value. This time, we’re going to use this distribution to get a p-value.
ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) +
geom_histogram(binwidth=0.05, fill="cornflowerblue", col="black", lwd=0.2) +
geom_vline(xintercept=obs_diff, lwd=1.5)
Our black line is our observed difference. We will take 2 times the fraction of the points that lie below our observed difference to be our p-value. We’ll use a function to formalize this as well.
# * FUNCTION TO CALCULATE P-VALUE
calculate_pvalue <- function(observed, simulated) {
if (observed < 0) {
p_value <- 2*(length(which(simulated<=observed))/length(simulated))
} else {
p_value <- 2*(length(which(simulated>=observed))/length(simulated))
}
p_value
}
# * APPLYING THE FUNCTION
calculate_pvalue(obs_diff, calculated_diffs)
## [1] 0.538
With a p-value like the one above, we have evidence in support of \(H_0\). We pretty much showed that when two distributions are the same, the permutation test works.
Now, how would this work on real data? One thing will definitely change for us in this setting.
We do not know in fact that the means between two groups are equal to each other. We only knew that they were equal to each other earlier because we set them to be through simulation.
Say that I was interested in testing
\[ H_0: \mu_{\text{(Danceability|Above mean tempo)}} = \mu_{\text{(Danceability|Below mean tempo)}} \] versus
\[ H_1: \mu_{\text{(Danceability|Above mean tempo)}} \neq \mu_{\text{(Danceability|Less than or equal to mean tempo)}} \]
for the following dataset
spotify_subset <- spotify_subset %>% mutate(tempo=ifelse(tempo>mean(tempo), "above", "below"))
head(spotify_subset)
## # A tibble: 6 x 4
## artist song_title danceability tempo
## <chr> <chr> <dbl> <chr>
## 1 Tori Kelly First Heartbreak 0.657 above
## 2 J. Cole No Role Modelz 0.696 below
## 3 Queen Don't Stop Me Now - Remastered 2011 0.559 above
## 4 Fall Out Boy Centuries 0.385 above
## 5 Martin Garrix Poison 0.502 above
## 6 Backstreet Boys I Want It That Way 0.683 below
Let’s see how many fall above and below the mean tempo.
spotify_subset %>% group_by(tempo) %>% summarize(n())
## # A tibble: 2 x 2
## tempo `n()`
## <chr> <int>
## 1 above 13
## 2 below 15
Let’s take a look at how the danceability behaves for above and below (or equal to) Q1 tempo.
ggplot(spotify_subset, aes(y=danceability, x=tempo, fill=tempo)) +
geom_boxplot(lwd=0.4)
ggplot(spotify_subset, aes(x=danceability, fill=tempo)) +
geom_histogram(binwidth=0.03, col="black", lwd=0.4)
ggplot(spotify_subset, aes(x=danceability, fill=tempo)) +
geom_histogram(binwidth=0.03, col="black", lwd=0.4) +
facet_wrap(~tempo)
What is our observed difference in means?
means <- spotify_subset %>% group_by(tempo) %>% summarize(mean(danceability)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff
## [1] 0.003923077
It’s pretty small! And like earlier, under the impression that the means are the same, we can reshuffle the labels. So, let’s use the function to generate labels.
spotify_relabels <- generate_labels(spotify_subset %>% pull(tempo), 1000)
head(spotify_relabels[,1:6])
## relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1 below below below below above below
## 2 above below below above below below
## 3 above above below below below above
## 4 above above above below below above
## 5 above above above above above below
## 6 above below above below below above
And then calculate the difference in sample means given all of these relabels. We will show these in a histogram with a vertical line representing our observed difference in means.
calculated_diffs <- calculate_sample_mean_diffs(spotify_subset %>% pull(danceability), spotify_relabels)
ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) +
geom_histogram(binwidth=0.01, fill="cornflowerblue", col="black", lwd=0.2) +
geom_vline(xintercept=obs_diff, lwd=1.5)
And here, we’ll calculate our p-value.
calculate_pvalue(obs_diff, calculated_diffs)
## [1] 0.882
For this example as well, we fail to reject \(H_0\). The means of the two groups are seemingly the same.
Here, I’ll show an example where we will reject \(H_0\) just because we failed to reject above twice, and you should see it once. I will not comment as much below. We are testing the same hypotheses and using the same functions.
set.seed(142)
sample_1 <- runif(n=10, min=-2, max=0)
sample_2 <- runif(n=10, min=2, max=4)
samples <- data.frame(cbind(sample_1, sample_2))
samples
## sample_1 sample_2
## 1 -0.20958012 3.713024
## 2 -0.60193654 3.762831
## 3 -0.08836513 3.527474
## 4 -0.87699277 3.121733
## 5 -0.37868435 3.030506
## 6 -0.26440232 3.112093
## 7 -1.43912677 2.293873
## 8 -0.63349114 3.072209
## 9 -1.13435167 3.168138
## 10 -1.13546463 2.353247
Putting data in long format for ease.
long_samples <- melt(samples, id.vars=NULL)
long_samples %>% sample_n(10) # * TO PREVIEW THE DATA
## variable value
## 1 sample_2 3.1681375
## 2 sample_2 3.0722088
## 3 sample_1 -0.2095801
## 4 sample_1 -0.6334911
## 5 sample_2 3.0305055
## 6 sample_1 -1.4391268
## 7 sample_2 3.1120927
## 8 sample_2 3.7130240
## 9 sample_1 -1.1343517
## 10 sample_1 -0.8769928
Providing visuals.
ggplot(long_samples, aes(x=variable, y=value, fill=variable)) +
geom_boxplot(lwd=0.4)
ggplot(long_samples, aes(x=value, fill=variable)) +
geom_histogram(binwidth=0.1, col="black", lwd=0.2)
Looking at sample statistics.
long_samples %>% group_by(variable) %>% summarize(x_bar=mean(value), s=sd(value))
## # A tibble: 2 x 3
## variable x_bar s
## <fct> <dbl> <dbl>
## 1 sample_1 -0.676 0.456
## 2 sample_2 3.12 0.496
Calculating our observed difference in means.
means <- long_samples %>% group_by(variable) %>% summarize(mean(value)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff
## [1] 3.791752
Creating relabels.
relabels <- generate_labels(long_samples %>% pull(variable), 1000)
head(relabels[,1:6])
## relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1 sample_2 sample_1 sample_1 sample_2 sample_2 sample_2
## 2 sample_2 sample_2 sample_2 sample_1 sample_2 sample_1
## 3 sample_2 sample_1 sample_2 sample_1 sample_2 sample_1
## 4 sample_1 sample_1 sample_1 sample_2 sample_1 sample_1
## 5 sample_1 sample_2 sample_2 sample_2 sample_1 sample_1
## 6 sample_1 sample_1 sample_1 sample_1 sample_2 sample_2
Calculating differences in sample means.
calculated_diffs <- calculate_sample_mean_diffs(long_samples %>% select(value), relabels)
Plotting the simulated null distribution.
ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) +
geom_histogram(binwidth=0.2, fill="cornflowerblue", col="black", lwd=0.2) +
geom_vline(xintercept=obs_diff, lwd=1.5)
Calculating our p-value.
calculate_pvalue(obs_diff, calculated_diffs)
## [1] 0
Based on this p-value, we would reject the null hypothesis. This is how it should be as we simulated the data from non-overlapping distributions.
We are going to see how competing tests perform on the data from part 2 and from part 3.
Under certain assumptions, we can use a two sample t-test. We assume independence between the groups and the distributions we are working with have similar shapes.
t.test(x=spotify_subset %>% filter(tempo=="above") %>% pull(danceability),
y=spotify_subset %>% filter(tempo=="below") %>% pull(danceability),
paired=FALSE)
##
## Welch Two Sample t-test
##
## data: spotify_subset %>% filter(tempo == "above") %>% pull(danceability) and spotify_subset %>% filter(tempo == "below") %>% pull(danceability)
## t = -0.069968, df = 25.529, p-value = 0.9448
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1192801 0.1114340
## sample estimates:
## mean of x mean of y
## 0.6150769 0.6190000
t.test(x=sample_1, y=sample_2, data=samples, paired=FALSE)
##
## Welch Two Sample t-test
##
## data: sample_1 and sample_2
## t = -17.796, df = 17.875, p-value = 8.124e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.239612 -3.343893
## sample estimates:
## mean of x mean of y
## -0.6762395 3.1155126
We have also studied the Mann-Whitney (Wilcoxon Rank Sum) test to test these same hypotheses. Let’s run the test on our data to compare.
wilcox.test(x=spotify_subset %>% filter(tempo=="above") %>% pull(danceability),
y=spotify_subset %>% filter(tempo=="below") %>% pull(danceability),
paired=FALSE)
##
## Wilcoxon rank sum test
##
## data: spotify_subset %>% filter(tempo == "above") %>% pull(danceability) and spotify_subset %>% filter(tempo == "below") %>% pull(danceability)
## W = 86, p-value = 0.6177
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x=sample_1, y=sample_2, data=samples, paired=FALSE)
##
## Wilcoxon rank sum test
##
## data: sample_1 and sample_2
## W = 0, p-value = 1.083e-05
## alternative hypothesis: true location shift is not equal to 0
For these two data scenarios (Spotify and theoretical), all decisions were the same. The two sample t-test gives the most extreme p-values (sans the information for the permutation test’s exact theoretical p-value). The test gave definitive numbers for failing to reject and rejecting the null.
Permutation test | Two sample t-test | Wilcoxon Rank-Sign Test | |
---|---|---|---|
Spotify | 0.882 | 0.9448 | 0.6177 |
Theoretical | 0 (Rounded) | 8.124e-13 | 1.083e-05 |