The chi-squared test is the second to last test you're learning this semester. (YAY!) On that note, we're really winding down now, and this is the last real lab section.
tibble()
and tribble()
.tibble()
and tribble()
.¶library(tibble)
tibble(name = c("Arya", "Daenerys" , "Sansa" , "Brienne"),
origin = c("Winterfell", "Dragonstone", "Winterfell", "Tarth" ),
occupation = c("Swordswoman", "Queen" , "Lady" , "Swordswoman"),
seasons = c(8 , 8 , 8 , 8 ))
tribble( ~Cersei, ~Tyrion,
"Queen", "Hand of the other queen",
"Evil", "Short")
Sometimes, we would like to know if counts of data follow a certain distribution. That is, we would like to ask the question of: did the data I observe come from the distribution I have hypothesized?
The data we are working are about Modcloth's clothing size recommendation system. We are going to look at the counts of how many of their recommendations fit, were too large, or were too small for the customer.
Let's read_csv()
in our data!
library(readr, warn.conflicts=FALSE)
modcloth <- read_csv("modcloth.csv")
Next, I'm using some normal dplyr
functions like filter()
and select()
.
library(dplyr, warn.conflicts=FALSE)
modcloth <- modcloth %>% filter(category %in% c("bottoms", "tops"))
modcloth <- modcloth %>% select(category, fit)
Here's a small preview of our data using sample_n()
.
modcloth %>% sample_n(5)
Now, let's use the table function to see how our counts are distributed.
o_i <- table(modcloth$fit)
o_i
Just out of interest, let's see what our sample proportions are.
sample_props <- counts_table / nrow(modcloth)
sample_props
When you're working on an algorithm that will recommend clothes, of course we would want all the clothes to fit all the customer we're recommending to! However, this can't always be the case. One size does not fit all!
Say that Modcloth wanted to maintain a probability distribution of:
$P(\text{Fit}) = 0.75$
$P(\text{Large}) = 0.125$
$P(\text{Small}) = 0.125$
Therefore, we'd have the hypotheses:
$H_0:$ Our counts data follow the prespecified distribution where $P(\text{Fit}) = 0.75$, $P(\text{Large}) = 0.125$, and $P(\text{Small}) = 0.75$.
$H_1:$ At least one of our counts do not follow the distribution.
Now, we'll start our $\chi^2$ test for goodness-of-fit.
The assumptions for a goodness-of-fit test is that we have an independent random sample (we'll assume this), and at least 80% of our cells have $\geq 5$ counts in them.
Now, we would like to create our $\chi^2$ test statistic $\sum_{i=1}^{k} \frac{[O_i-E_i]^2}{E_i}$.
We need to calculate $E_i$. We will calculate $E_i$ by multiplying each of the probabilities from the probability distribution times our sample size $n$.
n <- nrow(modcloth)
dist <- c(fit=0.75, large=0.125, small=0.125)
e_i <- n*dist
e_i
Now that we have our $E_i$, we can do our subtraction, squaring, division, and then finally sum up all our terms. I'll show this step-by-step. You only need to really do the last step.
o_i-e_i
(o_i-e_i)^2
(o_i-e_i)^2 / e_i
# * THIS CALCULATES THE FULL TEST STATISTIC
test_statistic_1 <- sum((o_i-e_i)^2 / e_i)
test_statistic_1
We'll now want to calculate the p-value using pchisq(test_statistic, degrees_of_freedom, lower.tail=FALSE)
. You will replace the values as fits your problem. For our problem,
lower.tail=FALSE
.pchisq(q=test_statistic_1, df=2, lower.tail=FALSE)
That is a super small p-value!
R
function.¶We can also use the following R
function to verify our answer.
chisq.test(x=o_i,
p=c(0.75, 0.125, 0.125))
We got the same test statistic and degrees of freedom. Perfect! Now, it's up to you to interpret our p-value and conclude our statistical test. Do we reject or fail to reject the notion that our counts come from the prespecified distribution?
Imagine now that we're interested in seeing how our size recommendation system works for tops and for bottoms. We are now asking the question are the distribution of clothing fits different or the same depending on which article of clothing the system recommends?
Therefore, our hypotheses of interest are
$H_0$: Clothing fit and article type are independent (or: the distribution of clothing fit for bottoms is the same as that of tops.)
$H_1$: They are dependent.
We already have our data loaded in, so what's next is that I'm going to filter
some data just to remove the NA
values, then make a table for our $O_i$.
modcloth <- modcloth %>% filter(!is.na(fit)) %>% filter(!is.na(category))
o_i <- o_ip <- table(modcloth$fit, modcloth$category)
o_i
The table setup above looks familiar, right? We've been working with tables like these to calculate conditional probabilities. Let's make a dataframe that we can work with.
In the following steps, I am reformatting our o_i
so we can use it conveniently in R
. I am going to overwrite our original o_i
. I am adding row names for our learning convenience (apparently, tibble
s don't need row names anymore? I think they're helpful for a lot though!)
o_i <- tibble(bottoms = c(10660, 2058, 2548),
tops = c(13498, 4168, 2698))
row.names(o_i) <- c("fit", "large", "small")
o_i
Before we get going, let's see if our distributions already look pretty different on a graph (so that we won't be surprised by our test results).
library(ggplot2)
o_ip <- data.frame(o_ip)
o_ip <- o_ip %>% rename(fit=Var1, article=Var2)
o_ip <- o_ip %>% mutate(relative_freq=Freq/rep(c(10660+2058+2548, 13498+4168+2698), each=3))
ggplot(o_ip, aes(x=fit, y=relative_freq, fill=article)) +
geom_bar(stat="identity", position="dodge") +
ylab("relative frequency") +
theme_minimal()
We see now that there is somewhat of a difference for bottoms/tops when it comes to large and small fits. However, the fit column seems to be quite similar for both article clothing types.
Next, I'll reformat o_i
to fit our coding needs. Here, I'm adding a column for the row totals.
o_i <- cbind(o_i, totals=rowSums(o_i))
o_i
Lastly, I'm adding in my column totals!
o_i <- rbind(o_i, totals=colSums(o_i))
o_i
In order to calculate our $E_i$, we use the formula
$$ E_i = \frac{[\text{Row total}]*[\text{Column total}]}{N} $$where $N$ is the grand total counts. There are other ways to do this, but this is easiest. Let's do the value for the first cell "by hand" in the chunk below.
fit_bottoms <- (15266*24158) / 35360
fit_bottoms
Let's do it also for small tops.
small_tops <- (5246*20364) / 35360
small_tops
The rest I'll do using code. (For you, continue using the formula above.)
e_i <- sapply(1:2, function(x) o_i[4,x]*o_i[1:3,3] / 35360)
e_i
And there are our expected values!
I'm going to first cut down o_i
to its original size. We're done using the row and column totals.
o_i <- o_i[1:3, 1:2]
o_i
Here's our test statistic.
# * THIS CALCULATES THE FULL TEST STATISTIC
test_statistic_2 <- sum((o_i-e_i)^2 / e_i)
test_statistic_2
Here's our p-value!
pchisq(q=test_statistic_2, df=(nrow(o_i)-1)*(ncol(o_i)-1), lower.tail=FALSE)
R
function.¶Our test statistic is very close to what was given from the below R
function output.
chisq.test(x=o_i, correct=FALSE)
Now, it's up to you to interpret and conclude!
Plenty of times in stats, some of our problems reduce into simpler ones. So is true for the story of $\chi^2$ and the z-test for 2 proportions.
Let's start with just the following data. (It's a subset from earlier.)
unfit <- o_i[2:3,]
unfit <- cbind(unfit, totals=rowSums(unfit))
unfit <- rbind(unfit, totals=colSums(unfit))
row.names(unfit) <- c("large", "small", "totals")
unfit
Imagine we are interested in testing the following hypotheses:
$H_0$: The distribution of clothing fit is the same for each article of clothing.
$H_1$: The distribution of clothing fit is not the same for each article of clothing.
This is a $\chi^2$ test for independence.
chisq.test(o_i[2:3,],)
From the above, we got a $\chi^2$ test statistic of $284.57 \approx 285$. Our p-value is very small. We would reach the conclusion of rejecting $H_0$. (Be able to interpret this in context!)
But what we can also do is take sample proportions and test those to be equal. The above $\chi^2$ test is equivalent to a 2 proportion z-test for $P(\text{Large}|\text{Tops})$ and $P(\text{Large}|\text{Bottoms})$. We don't have to test for $P(\text{Small}|\text{Tops})$ and $(\text{Large}|\text{Bottoms})$ because we can calculate these values using the complement rule.
$H_0$: $P(\text{Large}|\text{Tops}) = P(\text{Large}|\text{Bottoms})$
$H_1$: $P(\text{Large}|\text{Tops}) \neq P(\text{Large}|\text{Bottoms})$
Again, this is a 2 sample z-test for proportions.
p_large_tops <- 4168/6866
p_large_bots <- 2058/4606
p_large <- (4168+2058) / (6866+4606)
c(prop_large_tops = p_large_tops,
prop_large_bottoms = p_large_bots,
prop_large = p_large)
Off the bat, these proportions don't look quite the same. Let's see if statistical tests follow this thought.
numerator <- p_large_tops-p_large_bots
denominator <- sqrt(p_large*(1-p_large)*((1/6866)+(1/4606)))
z_statistic <- numerator / denominator
z_statistic
We can now calculate our p-value using pnorm()
. Since it's a two-sided hypothesis test, I'm multiplying by 2.
2*(1-pnorm(q=z_statistic, mean=0, sd=1))
With a small p-value like that, we're going to follow the $\chi^2$ test's suit and reject $H_0$. So, while the p-values for the tests are possibly different, the two tests bring us to the same conclusion. We have evidence to believe the (1) distributions are not the same, which can also be restated as, (2) $P(\text{Large}|\text{Tops}) \neq P(\text{Large}|\text{Bottoms})$.
Furthermore, while this may seem like a pop fact of sorts, the $\chi^2$ test statistic is simply the z-test statistic squared.
z_statistic^2
Recall, the $\chi^2$ test statistic $\approx 285$, and $z^2 \approx 285$ as well. (There is some rounding error involved here.) This comes from some theory! The $\chi^2$ distribution is mathemagically defined as the square of a standard normal distribution $N(0,1)$.