This test is called the “Wilcoxon Rank-Sum Test” in lecture, but I prefer Mann-Whitney because using the name makes it more distinguishable (at least for me) than the “Wilcoxon Rank-Sign Test”.
The Mann-Whitney test is the nonparametric cousin of the two sample t-test. That means in order to use the Mann-Whitney test, we don’t care for any distributional assumptions on our data. We still care, of course, about the randomness of our dataset!
They hypotheses we are interested in testing are \[ H_0: \text{The distributions of the two groups are the same.} \] versus
\[ H_1: \text{The distributions of the two groups are NOT the same.} \]
We have the following dataset on Modcloth clothing fit data. While we can ask many interesting questions about this dataset, we need to ask a question about means of two populations.
library(readr)
library(dplyr)
modcloth_sample <- read_csv("data/modcloth_sample.csv")[,-1] %>% select(-user_name)
head(modcloth_sample, 10)
## # A tibble: 10 x 16
## waist size quality `cup size` hips `bra size` category bust height
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 NA 14 3 c NA 36 bottoms NA 5ft 3…
## 2 NA 4 3 b 36 34 tops NA 5ft 3…
## 3 NA 4 5 c 36 34 tops NA 5ft 1…
## 4 NA 8 5 ddd/f NA 38 wedding NA 5ft 5…
## 5 NA 4 5 b 32 32 tops NA 5ft 2…
## 6 NA 4 5 a 35 32 bottoms 34 5ft 3…
## 7 NA 1 5 a 30 36 bottoms 20 3ft 2…
## 8 NA 8 5 b 30 34 outerwe… NA 5ft 5…
## 9 NA 12 4 b 45 38 tops NA 5ft 2…
## 10 NA 8 4 c 34 36 new NA 5ft 4…
## # … with 7 more variables: length <chr>, fit <chr>, user_id <dbl>, `shoe
## # size` <dbl>, `shoe width` <chr>, review_summary <chr>,
## # review_text <chr>
Here are the counts of clothing fits. We see that most of the clothes sent out fit (wow, that’s nice!) and the next largest group was that the clothing fit was large. However, the group who has a small fit had a count not too far from the large fit.
table(modcloth_sample$fit)
##
## fit large small
## 357 78 65
This leads me to try and understand the clothes that do not fit.
not_fit <- modcloth_sample %>% filter(fit != "fit")
head(not_fit, 10)
## # A tibble: 10 x 16
## waist size quality `cup size` hips `bra size` category bust height
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 NA 14 3 c NA 36 bottoms NA 5ft 3…
## 2 NA 4 3 b 36 34 tops NA 5ft 3…
## 3 NA 8 5 ddd/f NA 38 wedding NA 5ft 5…
## 4 NA 1 5 a 30 36 bottoms 20 3ft 2…
## 5 NA 8 4 c 34 36 new NA 5ft 4…
## 6 40 32 4 d 48 40 sale NA 5ft 9…
## 7 38 20 1 b 44 38 sale NA 5ft 8…
## 8 27 4 2 b NA 36 bottoms NA 5ft 7…
## 9 NA 20 5 b 35 38 tops NA 5ft 9…
## 10 NA 8 4 a 38 34 tops NA 5ft 5…
## # … with 7 more variables: length <chr>, fit <chr>, user_id <dbl>, `shoe
## # size` <dbl>, `shoe width` <chr>, review_summary <chr>,
## # review_text <chr>
While again there are many interesting questions that we may ask of this dataset, let’s study these hypotheses.
\[ H_0: \text{The distribution of hip size for "small" clothing fit and the distribution of "large" clothing fit are not the same.} \]
versus
\[ H_1: \text{The distribution of hip size for "small" clothing fit and the distribution of "large" clothing fit are not the same.} \]
library(ggplot2)
not_fit <- not_fit %>% filter(!is.na(hips)) %>% filter(!is.na(fit))
ggplot(not_fit, aes(x=hips, fill=fit)) +
geom_histogram(col="black", lwd=0.2, binwidth=1.5) +
ggtitle("Histogram of hip size")
not_fit <- not_fit %>% filter(!is.na(hips)) %>% filter(!is.na(fit))
ggplot(not_fit, aes(y=hips, x=fit, fill=fit)) +
geom_boxplot(lwd=0.5) +
ggtitle("Histogram of hip size")
We first want to use arrange
on the hips
column. I also used select
to grab only the columns we are interested in right now.
not_fit <- not_fit %>% select(fit, hips) %>% arrange(hips)
head(not_fit, 10)
## # A tibble: 10 x 2
## fit hips
## <chr> <dbl>
## 1 large 30
## 2 small 30
## 3 small 32
## 4 large 32
## 5 small 33
## 6 small 33
## 7 large 33
## 8 small 33
## 9 large 33
## 10 small 33
Don’t worry about how to create a function. Just understand the process of what it is doing. I commented below. The process is how you would do it by hand.
assign_wrs <- function(obs) {
ranks <- c() # * CREATE EMPTY VECTOR TO SAVE RANKS
this_rank <- 1 # * START WITH RANK 1
# * FOR EACH OF THE UNIQUE VALUES OF OBSERVATIONS
for(ix in 1:length(unique(obs))) {
# * CHECK IF THERE ARE DUPLICATED OBSERVATIONS
these_ix <- which(obs == unique(obs)[ix])
# * IF THERE ARE NO DUPLICATES, ASSIGN STANDARD RANK
if (length(these_ix)==1) {
ranks <- c(ranks, this_rank)
this_rank <- length(ranks) + 1
# * IF THERE ARE DUPLICATES, ASSIGN AN AVERAGED RANK
} else {
these_ranks <- rep(mean(c(this_rank, this_rank+length(these_ix)-1)), length(these_ix))
ranks <- c(ranks, these_ranks)
this_rank <- length(ranks) + 1
}
}
# * OUTPUT RANKS
ranks
}
I will use mutate
to apply the function I wrote above to assign ranks to each of the hip sizes.
# * APPLY FUNCTION USING MUTATE
not_fit <- not_fit %>% mutate(rank=assign_wrs(hips))
head(not_fit, 10)
## # A tibble: 10 x 3
## fit hips rank
## <chr> <dbl> <dbl>
## 1 large 30 1.5
## 2 small 30 1.5
## 3 small 32 3.5
## 4 large 32 3.5
## 5 small 33 7.5
## 6 small 33 7.5
## 7 large 33 7.5
## 8 small 33 7.5
## 9 large 33 7.5
## 10 small 33 7.5
not_fit %>% filter(fit=="large") %>% summarize(sum(rank))
## # A tibble: 1 x 1
## `sum(rank)`
## <dbl>
## 1 3262.
not_fit %>% filter(fit=="small") %>% summarize(sum(rank))
## # A tibble: 1 x 1
## `sum(rank)`
## <dbl>
## 1 1688.
Below is our test statistic \(Z_W\). \[ Z_W = \frac{W-\mu_W}{\sigma_W} \]
where the mean is
\[ \mu_W = \frac{n_S(n_S+n_L+1)}{2} \] and the standard deviation is
\[ \sigma_W = \sqrt{\frac{n_Sn_L(n_S+n_L+1)}{12}} \]
and we have \(W=min\{3261.5, 1688.5\}\). We can calculate this in R
.
not_fit %>% group_by(fit) %>% summarize(n=n())
## # A tibble: 2 x 2
## fit n
## <chr> <int>
## 1 large 62
## 2 small 37
n_l <- 62
n_s <- 37
not_fit %>% summarize(mu_w=n_s*(n_s+n_l+1)/2,
sigma_w=sqrt(n_s*n_l*(n_s+n_l+1)/12),
z_w=(1688.5-mu_w)/sigma_w)
## # A tibble: 1 x 3
## mu_w sigma_w z_w
## <dbl> <dbl> <dbl>
## 1 1850 138. -1.17
2*pnorm(-1.168063, lower.tail=TRUE)
## [1] 0.2427814
Given that \(\mu_L=\mu_S\), we expect to see data like ours or more extreme (more deviant from the mean) 24.28% of the time.
We fail to reject \(H_0\), the notion that the large fit hip size is the same as the small fit hip size.
R
shortcutWe can verify our result using the wilcox.test
function in R
with these parameters.
wilcox.test(x=not_fit %>% filter(fit=="large") %>% pull(hips),
y=not_fit %>% filter(fit=="small") %>% pull(hips),
paired=FALSE, correct=TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: not_fit %>% filter(fit == "large") %>% pull(hips) and not_fit %>% filter(fit == "small") %>% pull(hips)
## W = 1308.5, p-value = 0.2431
## alternative hypothesis: true location shift is not equal to 0
Because we have data from similarly shaped distributions that don’t have many outliers and large enough \(n_L\) and \(n_S\), we should be interested in using a two sample t-test. Recall the assumptions of a two sample t-test.
t.test(x=not_fit %>% filter(fit=="large") %>% pull(hips),
y=not_fit %>% filter(fit=="small") %>% pull(hips),
paired=FALSE)
##
## Welch Two Sample t-test
##
## data: not_fit %>% filter(fit == "large") %>% pull(hips) and not_fit %>% filter(fit == "small") %>% pull(hips)
## t = 1.2349, df = 77.883, p-value = 0.2206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.054818 4.501200
## sample estimates:
## mean of x mean of y
## 42.20968 40.48649
Our p-value here (0.2206) is close to what we calculated earlier (0.2431). Both of the p-values would lead to the same conclusion.