The permutation test is based off of simulating a null distribution. From that null distribution, we are going to calculate a p-value.

1. Theoretical Example

Say the hypotheses we wish to test are \[ H_0: \mu_1=\mu_2 \text{ versus } H_1: \mu_1 \neq \mu_2 \]

for some population 1 and 2.

The null distribution

Your null distribution is governed by your null hypothesis. Our null says, “Mean 1 equals mean 2”. So, we say, “Yes, sir, Ms. Null. Mean 1 will equal mean 2 in our simulation.”

For a given dataset, if the means of both of the distributions were the same, then any of the data we have could have been from either population 1 or population 2. Imagine you have two datasets pulled from \(N(0,0.5)\) distributions, i.e. we know that \(\mu_1=\mu_2\).

We are going to show that we can use the steps in a permutation test with the expectation of failing to reject the null hypothesis.

set.seed(142)
sample_1 <- rnorm(n=10, mean=0, sd=0.5)
sample_2 <- rnorm(n=10, mean=0, sd=0.5)
samples <- data.frame(cbind(sample_1, sample_2))
samples

##       sample_1    sample_2
## 1   0.62736043  0.73789956
## 2   0.85204271 -0.81809309
## 3   0.44016151  0.64780971
## 4  -0.29077247 -0.06890457
## 5  -0.08459428  0.24827375
## 6   0.53238943  0.17254728
## 7   0.35918774  0.21900323
## 8   0.01912117  0.06949258
## 9  -0.52483197 -0.68411150
## 10  0.10615675  0.29494909

Let’s melt the dataframe into long format where instead of having two columns corresponding to each sample, we will have one column denoting whether the value came from which sample and the value itself.

library(reshape2)
long_samples <- melt(samples, id.vars=NULL)

library(dplyr)
long_samples %>% sample_n(10)

##    variable       value
## 1  sample_1  0.85204271
## 2  sample_1 -0.08459428
## 3  sample_1  0.44016151
## 4  sample_1  0.53238943
## 5  sample_2  0.21900323
## 6  sample_1  0.10615675
## 7  sample_2  0.64780971
## 8  sample_2  0.06949258
## 9  sample_1  0.01912117
## 10 sample_2 -0.81809309

Let’s visualize this now.

library(ggplot2)
ggplot(long_samples, aes(x=variable, y=value, fill=variable)) +
  geom_boxplot(lwd=0.4)

ggplot(long_samples, aes(x=value, fill=variable)) +
  geom_histogram(binwidth=0.1, col="black", lwd=0.2)

ggplot(long_samples, aes(x=value, fill=variable)) +
  geom_histogram(binwidth=0.1, col="black", lwd=0.2) +
  facet_wrap(~variable)

Both of the distributions are close to 0, but sample 2 has wider variance. This is not due to the underlying distribution. Both of them were simulated from \(N(0,0.5)\). This is due to sampling (sample size is only 10 for both). Based on our dataset, we see these statistics.

long_samples %>% group_by(variable) %>% summarize(x_bar=mean(value), s=sd(value))

## # A tibble: 2 x 3
##   variable  x_bar     s
##   <fct>     <dbl> <dbl>
## 1 sample_1 0.204  0.433
## 2 sample_2 0.0819 0.502

Typically, in order to test something like the above, we’d want to use a two sample t-test for the comparison of two means.

Recall, for your comparison as we continue, a t-test we would have:

\[ T= \frac{\bar{x_1}-\bar{x_2}}{\hat{SE}(\bar{x_1}-\bar{x_2})} \]

We’ll take this time now to calculate our observed mean difference.

means <- long_samples %>% group_by(variable) %>% summarize(mean(value)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff

## [1] -0.1217355

Fixing the low sample size problem

Because we have a low sample size, our \(\bar{x}\) is expected to be imprecise for both of our samples. (And they are; 0.532 and -0.206 are not demonstrative of \(\mu=0\) which we set to be true at the beginning of the dataset.)

If it is true that \(\mu_1=\mu_2\), then we can shuffle our labels on our data. I am going to use this function to generate new labels.

# * FUNCTION TO CREATE LABELS
generate_labels <- function(labels, repetitions) {
    relabels <- do.call(cbind, 
                        lapply(1:repetitions, function(x) 
                          sample(as.character(labels), 
                                 length(labels), 
                                 replace=FALSE)))
    relabels <- data.frame(data=relabels)
    names(relabels) <- paste0("relabel_", 1:repetitions)
    relabels
}

# * APPLY FUNCTION
relabels <- generate_labels(long_samples %>% pull(variable), 1000)
head(relabels[,1:6])

##   relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1  sample_1  sample_1  sample_2  sample_2  sample_2  sample_1
## 2  sample_2  sample_2  sample_1  sample_2  sample_1  sample_1
## 3  sample_1  sample_2  sample_1  sample_2  sample_1  sample_1
## 4  sample_1  sample_1  sample_2  sample_1  sample_1  sample_2
## 5  sample_2  sample_2  sample_2  sample_1  sample_1  sample_2
## 6  sample_1  sample_1  sample_1  sample_2  sample_2  sample_1

How does this help? Well, now we can try to make a null distribution. We want to take the sample means from both of these values and subtract one from another because…

\[ H_0: \mu_1=\mu_2 \implies H_0: \mu_2-\mu_1=0 \]

Remember, if our null were true, then we would see our distribution wind up around a near 0 center. We’ll use the below function.

# * FUNCTION TO CALCULATE SAMPLE MEAN DIFFERENCES
calculate_sample_mean_diffs <- function(value_col, relabel_df) {
  # * CREATE A "NOTE TAKER"
  calculated_diffs <- c()

  # * FOR EACH OF THE RELABELS
  for (ix in 1:ncol(relabel_df)) {
    # * CALCULATE THE DIFFERENCE IN MEANS
    tmp_df <- data.frame(cbind(value=value_col, relabel=relabel_df[,ix]))
    means <- tmp_df %>% group_by(relabel) %>% summarize(mean(value)) %>% pull()
    calculated_diffs <- c(calculated_diffs, means[2]-means[1])
  }
  
  calculated_diffs
}

# * APPLYING THE FUNCTION
calculated_diffs <- calculate_sample_mean_diffs(long_samples %>% select(value), relabels)

So here’s our simulated null distribution. So, normally we would have calculated some test statistic and compare it to a known distribution like normal, student-t, F distribution, or chi-squared to derive a p-value. This time, we’re going to use this distribution to get a p-value.

ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) + 
  geom_histogram(binwidth=0.05, fill="cornflowerblue", col="black", lwd=0.2) +
  geom_vline(xintercept=obs_diff, lwd=1.5)

Our black line is our observed difference. We will take 2 times the fraction of the points that lie below our observed difference to be our p-value. We’ll use a function to formalize this as well.

# * FUNCTION TO CALCULATE P-VALUE
calculate_pvalue <- function(observed, simulated) {
  if (observed < 0) {
    p_value <- 2*(length(which(simulated<=observed))/length(simulated))
  } else {
    p_value <- 2*(length(which(simulated>=observed))/length(simulated))
  }
  p_value
}

# * APPLYING THE FUNCTION
calculate_pvalue(obs_diff, calculated_diffs)

## [1] 0.538

With a p-value like the one above, we have evidence in support of \(H_0\). We pretty much showed that when two distributions are the same, the permutation test works.

2. Real data example

Now, how would this work on real data? One thing will definitely change for us in this setting.

We do not know in fact that the means between two groups are equal to each other. We only knew that they were equal to each other earlier because we set them to be through simulation.

Say that I was interested in testing

\[ H_0: \mu_{\text{(Danceability|Above mean tempo)}} = \mu_{\text{(Danceability|Below mean tempo)}} \] versus

\[ H_1: \mu_{\text{(Danceability|Above mean tempo)}} \neq \mu_{\text{(Danceability|Less than or equal to mean tempo)}} \]

for the following dataset

spotify_subset <- spotify_subset %>% mutate(tempo=ifelse(tempo>mean(tempo), "above", "below"))
head(spotify_subset)

## # A tibble: 6 x 4
##   artist          song_title                          danceability tempo
##   <chr>           <chr>                                      <dbl> <chr>
## 1 Tori Kelly      First Heartbreak                           0.657 above
## 2 J. Cole         No Role Modelz                             0.696 below
## 3 Queen           Don't Stop Me Now - Remastered 2011        0.559 above
## 4 Fall Out Boy    Centuries                                  0.385 above
## 5 Martin Garrix   Poison                                     0.502 above
## 6 Backstreet Boys I Want It That Way                         0.683 below

Let’s see how many fall above and below the mean tempo.

spotify_subset %>% group_by(tempo) %>% summarize(n())

## # A tibble: 2 x 2
##   tempo `n()`
##   <chr> <int>
## 1 above    13
## 2 below    15

Let’s take a look at how the danceability behaves for above and below (or equal to) Q1 tempo.

ggplot(spotify_subset, aes(y=danceability, x=tempo, fill=tempo)) +
  geom_boxplot(lwd=0.4)

ggplot(spotify_subset, aes(x=danceability, fill=tempo)) +
  geom_histogram(binwidth=0.03, col="black", lwd=0.4)

ggplot(spotify_subset, aes(x=danceability, fill=tempo)) +
  geom_histogram(binwidth=0.03, col="black", lwd=0.4) +
  facet_wrap(~tempo)

What is our observed difference in means?

means <- spotify_subset %>% group_by(tempo) %>% summarize(mean(danceability)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff

## [1] 0.003923077

It’s pretty small! And like earlier, under the impression that the means are the same, we can reshuffle the labels. So, let’s use the function to generate labels.

spotify_relabels <- generate_labels(spotify_subset %>% pull(tempo), 1000)
head(spotify_relabels[,1:6])

##   relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1     below     below     below     below     above     below
## 2     above     below     below     above     below     below
## 3     above     above     below     below     below     above
## 4     above     above     above     below     below     above
## 5     above     above     above     above     above     below
## 6     above     below     above     below     below     above

And then calculate the difference in sample means given all of these relabels. We will show these in a histogram with a vertical line representing our observed difference in means.

calculated_diffs <- calculate_sample_mean_diffs(spotify_subset %>% pull(danceability), spotify_relabels)

ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) + 
  geom_histogram(binwidth=0.01, fill="cornflowerblue", col="black", lwd=0.2) +
  geom_vline(xintercept=obs_diff, lwd=1.5)

And here, we’ll calculate our p-value.

calculate_pvalue(obs_diff, calculated_diffs)

## [1] 0.882

For this example as well, we fail to reject \(H_0\). The means of the two groups are seemingly the same.

3. Theoretical example to show the rejection of \(H_0\)

Here, I’ll show an example where we will reject \(H_0\) just because we failed to reject above twice, and you should see it once. I will not comment as much below. We are testing the same hypotheses and using the same functions.

set.seed(142)
sample_1 <- runif(n=10, min=-2, max=0)
sample_2 <- runif(n=10, min=2, max=4)
samples <- data.frame(cbind(sample_1, sample_2))
samples

##       sample_1 sample_2
## 1  -0.20958012 3.713024
## 2  -0.60193654 3.762831
## 3  -0.08836513 3.527474
## 4  -0.87699277 3.121733
## 5  -0.37868435 3.030506
## 6  -0.26440232 3.112093
## 7  -1.43912677 2.293873
## 8  -0.63349114 3.072209
## 9  -1.13435167 3.168138
## 10 -1.13546463 2.353247

Putting data in long format for ease.

long_samples <- melt(samples, id.vars=NULL)
long_samples %>% sample_n(10)   # * TO PREVIEW THE DATA

##    variable      value
## 1  sample_2  3.1681375
## 2  sample_2  3.0722088
## 3  sample_1 -0.2095801
## 4  sample_1 -0.6334911
## 5  sample_2  3.0305055
## 6  sample_1 -1.4391268
## 7  sample_2  3.1120927
## 8  sample_2  3.7130240
## 9  sample_1 -1.1343517
## 10 sample_1 -0.8769928

Providing visuals.

ggplot(long_samples, aes(x=variable, y=value, fill=variable)) +
  geom_boxplot(lwd=0.4)

ggplot(long_samples, aes(x=value, fill=variable)) +
  geom_histogram(binwidth=0.1, col="black", lwd=0.2)

Looking at sample statistics.

long_samples %>% group_by(variable) %>% summarize(x_bar=mean(value), s=sd(value))

## # A tibble: 2 x 3
##   variable  x_bar     s
##   <fct>     <dbl> <dbl>
## 1 sample_1 -0.676 0.456
## 2 sample_2  3.12  0.496

Calculating our observed difference in means.

means <- long_samples %>% group_by(variable) %>% summarize(mean(value)) %>% pull()
obs_diff <- means[2]-means[1]
obs_diff

## [1] 3.791752

Creating relabels.

relabels <- generate_labels(long_samples %>% pull(variable), 1000)
head(relabels[,1:6])

##   relabel_1 relabel_2 relabel_3 relabel_4 relabel_5 relabel_6
## 1  sample_2  sample_1  sample_1  sample_2  sample_2  sample_2
## 2  sample_2  sample_2  sample_2  sample_1  sample_2  sample_1
## 3  sample_2  sample_1  sample_2  sample_1  sample_2  sample_1
## 4  sample_1  sample_1  sample_1  sample_2  sample_1  sample_1
## 5  sample_1  sample_2  sample_2  sample_2  sample_1  sample_1
## 6  sample_1  sample_1  sample_1  sample_1  sample_2  sample_2

Calculating differences in sample means.

calculated_diffs <- calculate_sample_mean_diffs(long_samples %>% select(value), relabels)

Plotting the simulated null distribution.

ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) + 
  geom_histogram(binwidth=0.2, fill="cornflowerblue", col="black", lwd=0.2) +
  geom_vline(xintercept=obs_diff, lwd=1.5)

Calculating our p-value.

calculate_pvalue(obs_diff, calculated_diffs)

## [1] 0

Based on this p-value, we would reject the null hypothesis. This is how it should be as we simulated the data from non-overlapping distributions.

4. Comparisons

We are going to see how competing tests perform on the data from part 2 and from part 3.

To the two sample t-test

Under certain assumptions, we can use a two sample t-test. We assume independence between the groups and the distributions we are working with have similar shapes.

Spotify data from part 2

t.test(x=spotify_subset %>% filter(tempo=="above") %>% pull(danceability),
       y=spotify_subset %>% filter(tempo=="below") %>% pull(danceability),
       paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  spotify_subset %>% filter(tempo == "above") %>% pull(danceability) and spotify_subset %>% filter(tempo == "below") %>% pull(danceability)
## t = -0.069968, df = 25.529, p-value = 0.9448
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1192801  0.1114340
## sample estimates:
## mean of x mean of y 
## 0.6150769 0.6190000

Theoretical data from part 3

t.test(x=sample_1, y=sample_2, data=samples, paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  sample_1 and sample_2
## t = -17.796, df = 17.875, p-value = 8.124e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.239612 -3.343893
## sample estimates:
##  mean of x  mean of y 
## -0.6762395  3.1155126

To the Mann-Whitney U Test

We have also studied the Mann-Whitney (Wilcoxon Rank Sum) test to test these same hypotheses. Let’s run the test on our data to compare.

Spotify data from part 2

wilcox.test(x=spotify_subset %>% filter(tempo=="above") %>% pull(danceability),
            y=spotify_subset %>% filter(tempo=="below") %>% pull(danceability),
            paired=FALSE)

## 
##  Wilcoxon rank sum test
## 
## data:  spotify_subset %>% filter(tempo == "above") %>% pull(danceability) and spotify_subset %>% filter(tempo == "below") %>% pull(danceability)
## W = 86, p-value = 0.6177
## alternative hypothesis: true location shift is not equal to 0

Theoretical data from part 3

wilcox.test(x=sample_1, y=sample_2, data=samples, paired=FALSE)

## 
##  Wilcoxon rank sum test
## 
## data:  sample_1 and sample_2
## W = 0, p-value = 1.083e-05
## alternative hypothesis: true location shift is not equal to 0

Comparison table

For these two data scenarios (Spotify and theoretical), all decisions were the same. The two sample t-test gives the most extreme p-values (sans the information for the permutation test’s exact theoretical p-value). The test gave definitive numbers for failing to reject and rejecting the null.

	Permutation test	Two sample t-test	Wilcoxon Rank-Sign Test
Spotify	0.882	0.9448	0.6177
Theoretical	0 (Rounded)	8.124e-13	1.083e-05

Permutation Test

1. Theoretical Example

The null distribution

Fixing the low sample size problem

2. Real data example

3. Theoretical example to show the rejection of \(H_0\)

4. Comparisons

To the two sample t-test

Spotify data from part 2

Theoretical data from part 3

To the Mann-Whitney U Test

Spotify data from part 2

Theoretical data from part 3

Comparison table