This test is called the “Wilcoxon Rank-Sum Test” in lecture, but I prefer Mann-Whitney because using the name makes it more distinguishable (at least for me) than the “Wilcoxon Rank-Sign Test”.

The Mann-Whitney test is the nonparametric cousin of the two sample t-test. That means in order to use the Mann-Whitney test, we don’t care for any distributional assumptions on our data. We still care, of course, about the randomness of our dataset!

They hypotheses we are interested in testing are \[ H_0: \text{The distributions of the two groups are the same.} \] versus

\[ H_1: \text{The distributions of the two groups are NOT the same.} \]

Deriving our question

We have the following dataset on Modcloth clothing fit data. While we can ask many interesting questions about this dataset, we need to ask a question about means of two populations.

library(readr)
library(dplyr)
modcloth_sample <- read_csv("data/modcloth_sample.csv")[,-1] %>% select(-user_name)
head(modcloth_sample, 10)

## # A tibble: 10 x 16
##    waist  size quality `cup size`  hips `bra size` category  bust height
##    <dbl> <dbl>   <dbl> <chr>      <dbl>      <dbl> <chr>    <dbl> <chr> 
##  1    NA    14       3 c             NA         36 bottoms     NA 5ft 3…
##  2    NA     4       3 b             36         34 tops        NA 5ft 3…
##  3    NA     4       5 c             36         34 tops        NA 5ft 1…
##  4    NA     8       5 ddd/f         NA         38 wedding     NA 5ft 5…
##  5    NA     4       5 b             32         32 tops        NA 5ft 2…
##  6    NA     4       5 a             35         32 bottoms     34 5ft 3…
##  7    NA     1       5 a             30         36 bottoms     20 3ft 2…
##  8    NA     8       5 b             30         34 outerwe…    NA 5ft 5…
##  9    NA    12       4 b             45         38 tops        NA 5ft 2…
## 10    NA     8       4 c             34         36 new         NA 5ft 4…
## # … with 7 more variables: length <chr>, fit <chr>, user_id <dbl>, `shoe
## #   size` <dbl>, `shoe width` <chr>, review_summary <chr>,
## #   review_text <chr>

Here are the counts of clothing fits. We see that most of the clothes sent out fit (wow, that’s nice!) and the next largest group was that the clothing fit was large. However, the group who has a small fit had a count not too far from the large fit.

table(modcloth_sample$fit)

## 
##   fit large small 
##   357    78    65

This leads me to try and understand the clothes that do not fit.

not_fit <- modcloth_sample %>% filter(fit != "fit")
head(not_fit, 10)

## # A tibble: 10 x 16
##    waist  size quality `cup size`  hips `bra size` category  bust height
##    <dbl> <dbl>   <dbl> <chr>      <dbl>      <dbl> <chr>    <dbl> <chr> 
##  1    NA    14       3 c             NA         36 bottoms     NA 5ft 3…
##  2    NA     4       3 b             36         34 tops        NA 5ft 3…
##  3    NA     8       5 ddd/f         NA         38 wedding     NA 5ft 5…
##  4    NA     1       5 a             30         36 bottoms     20 3ft 2…
##  5    NA     8       4 c             34         36 new         NA 5ft 4…
##  6    40    32       4 d             48         40 sale        NA 5ft 9…
##  7    38    20       1 b             44         38 sale        NA 5ft 8…
##  8    27     4       2 b             NA         36 bottoms     NA 5ft 7…
##  9    NA    20       5 b             35         38 tops        NA 5ft 9…
## 10    NA     8       4 a             38         34 tops        NA 5ft 5…
## # … with 7 more variables: length <chr>, fit <chr>, user_id <dbl>, `shoe
## #   size` <dbl>, `shoe width` <chr>, review_summary <chr>,
## #   review_text <chr>

While again there are many interesting questions that we may ask of this dataset, let’s study these hypotheses.

\[ H_0: \text{The distribution of hip size for "small" clothing fit and the distribution of "large" clothing fit are not the same.} \]

versus

\[ H_1: \text{The distribution of hip size for "small" clothing fit and the distribution of "large" clothing fit are not the same.} \]

Visualizing our data

library(ggplot2)
not_fit <- not_fit %>% filter(!is.na(hips)) %>% filter(!is.na(fit))
ggplot(not_fit, aes(x=hips, fill=fit)) + 
  geom_histogram(col="black", lwd=0.2, binwidth=1.5) + 
  ggtitle("Histogram of hip size")

not_fit <- not_fit %>% filter(!is.na(hips)) %>% filter(!is.na(fit))
ggplot(not_fit, aes(y=hips, x=fit, fill=fit)) + 
  geom_boxplot(lwd=0.5) + 
  ggtitle("Histogram of hip size")

1. The long, step-by-step way

Arranging our data

We first want to use arrange on the hips column. I also used select to grab only the columns we are interested in right now.

not_fit <- not_fit %>% select(fit, hips) %>% arrange(hips)
head(not_fit, 10)

## # A tibble: 10 x 2
##    fit    hips
##    <chr> <dbl>
##  1 large    30
##  2 small    30
##  3 small    32
##  4 large    32
##  5 small    33
##  6 small    33
##  7 large    33
##  8 small    33
##  9 large    33
## 10 small    33

Assign ranks using function

Don’t worry about how to create a function. Just understand the process of what it is doing. I commented below. The process is how you would do it by hand.

assign_wrs <- function(obs) {
  ranks <- c()            # * CREATE EMPTY VECTOR TO SAVE RANKS
  this_rank <- 1          # * START WITH RANK 1
  
  # * FOR EACH OF THE UNIQUE VALUES OF OBSERVATIONS
  for(ix in 1:length(unique(obs))) {
    
    # * CHECK IF THERE ARE DUPLICATED OBSERVATIONS
    these_ix <- which(obs == unique(obs)[ix])
    
      # * IF THERE ARE NO DUPLICATES, ASSIGN STANDARD RANK
      if (length(these_ix)==1) {
        ranks <- c(ranks, this_rank)
        this_rank <- length(ranks) + 1
      
      # * IF THERE ARE DUPLICATES, ASSIGN AN AVERAGED RANK
      } else {
          these_ranks <- rep(mean(c(this_rank, this_rank+length(these_ix)-1)), length(these_ix))
          ranks <- c(ranks, these_ranks)
          this_rank <- length(ranks) + 1
      }
  }
  
  # * OUTPUT RANKS
  ranks
}

I will use mutate to apply the function I wrote above to assign ranks to each of the hip sizes.

# * APPLY FUNCTION USING MUTATE
not_fit <- not_fit %>% mutate(rank=assign_wrs(hips))
head(not_fit, 10)

## # A tibble: 10 x 3
##    fit    hips  rank
##    <chr> <dbl> <dbl>
##  1 large    30   1.5
##  2 small    30   1.5
##  3 small    32   3.5
##  4 large    32   3.5
##  5 small    33   7.5
##  6 small    33   7.5
##  7 large    33   7.5
##  8 small    33   7.5
##  9 large    33   7.5
## 10 small    33   7.5

Sum up the ranks for large

not_fit %>% filter(fit=="large") %>% summarize(sum(rank))

## # A tibble: 1 x 1
##   `sum(rank)`
##         <dbl>
## 1       3262.

Sum up the ranks for small

not_fit %>% filter(fit=="small") %>% summarize(sum(rank))

## # A tibble: 1 x 1
##   `sum(rank)`
##         <dbl>
## 1       1688.

Compare the sums in a standardized way

Below is our test statistic \(Z_W\). \[ Z_W = \frac{W-\mu_W}{\sigma_W} \]

where the mean is

\[ \mu_W = \frac{n_S(n_S+n_L+1)}{2} \] and the standard deviation is

\[ \sigma_W = \sqrt{\frac{n_Sn_L(n_S+n_L+1)}{12}} \]

and we have \(W=min\{3261.5, 1688.5\}\). We can calculate this in R.

not_fit %>% group_by(fit) %>% summarize(n=n())

## # A tibble: 2 x 2
##   fit       n
##   <chr> <int>
## 1 large    62
## 2 small    37

n_l <- 62
n_s <- 37
not_fit %>% summarize(mu_w=n_s*(n_s+n_l+1)/2,
                      sigma_w=sqrt(n_s*n_l*(n_s+n_l+1)/12),
                      z_w=(1688.5-mu_w)/sigma_w)

## # A tibble: 1 x 3
##    mu_w sigma_w   z_w
##   <dbl>   <dbl> <dbl>
## 1  1850    138. -1.17

2*pnorm(-1.168063, lower.tail=TRUE)

## [1] 0.2427814

Interpretations and conclusion

Given that \(\mu_L=\mu_S\), we expect to see data like ours or more extreme (more deviant from the mean) 24.28% of the time.

We fail to reject \(H_0\), the notion that the large fit hip size is the same as the small fit hip size.

2. The `R` shortcut

We can verify our result using the wilcox.test function in R with these parameters.

wilcox.test(x=not_fit %>% filter(fit=="large") %>% pull(hips),
            y=not_fit %>% filter(fit=="small") %>% pull(hips),
            paired=FALSE, correct=TRUE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  not_fit %>% filter(fit == "large") %>% pull(hips) and not_fit %>% filter(fit == "small") %>% pull(hips)
## W = 1308.5, p-value = 0.2431
## alternative hypothesis: true location shift is not equal to 0

3. Comparison to the two sample t-test

Because we have data from similarly shaped distributions that don’t have many outliers and large enough \(n_L\) and \(n_S\), we should be interested in using a two sample t-test. Recall the assumptions of a two sample t-test.

t.test(x=not_fit %>% filter(fit=="large") %>% pull(hips),
       y=not_fit %>% filter(fit=="small") %>% pull(hips),
       paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  not_fit %>% filter(fit == "large") %>% pull(hips) and not_fit %>% filter(fit == "small") %>% pull(hips)
## t = 1.2349, df = 77.883, p-value = 0.2206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.054818  4.501200
## sample estimates:
## mean of x mean of y 
##  42.20968  40.48649

Our p-value here (0.2206) is close to what we calculated earlier (0.2431). Both of the p-values would lead to the same conclusion.

Mann-Whitney U Test