Please contribute if you see any errors. Thanks.

Problem 1.

Part (a)

This test does not require distributional assumptions. It uses ranks and is comparable to a paired t-test.

\(\bigcirc\) Mann-Whitney Test

ANSWER Wilxocon Rank Sign Test

\(\bigcirc\) Bootstrap

\(\bigcirc\) Kruskal-Wallis Test

Part (b)

This method allows us to create confidence intervals based off of a simulated sampling distribution.

\(\bigcirc\) t-Test

\(\bigcirc\) Wilxocon Rank Sum Test

ANSWER Bootstrap

\(\bigcirc\) Permutation Test

Part (c)

This method uses relabels to simulated the distribution under the null hypothesis.

ANSWER Permutation Test

\(\bigcirc\) Kruskal-Wallis Test

\(\bigcirc\) Wilxocon Rank Sign Test

\(\bigcirc\) z-test

Part (d)

This nonparametric test is comparable to an ANOVA F-test.

\(\bigcirc\) Permutation Test

ANSWER Kruskal-Wallis Test

\(\bigcirc\) Wilxocon Rank Sign Test

\(\bigcirc\) z-test

Problem 2.

The following problem cites this study (Mithoefer et al). You can learn more about the study’s data here.

The 2018 paper by Mithoefer et al studied the effect of MDMA (3,4-methylenedioxymethamphetamine) on military veterans, firefighters, and police officers with post-traumatic stress disorder. Each of the subjects (n=26) were given different doses of MDMA and a psychotherapy session, then had their Clinician-Administered PTSD scale score was measured.

Low dose MDMA Medium dose MDMA High dose MDMA
n=7 n=7 n=12
30 mg 75 mg 125 mg

The study wants to test their hypotheses about whether these 3 groups had the same mean Clinician-Administered PTSD scale score.

Here is another table for more information.

30 mg 75 mg 125 mg
Mean PTSD duration 68.9 58.3 110.9

Part (a)

The following is the start of a DAG for this study.

1. What is \(A\) in the study?
MDMA and psychotherapy.

2. What is \(Y\) in the study?
Clinician-Administrered PTSD scale score.

3. What kind of study is this?
This is an experiment / randomized controlled trial. The study collected its data in a prospective manner. We want to know if MDMA and psychotherapy causes an improved PTSD scale score (causative).

4. What test should this study use to test their hypotheses?
An ANOVA F test.

Part (b)

Can you tell if the assumptions for this test completely satisfied? Choose an answer, then elaborate in the space below the answer.

\(\bigcirc\) Yes

ANSWER No

We cannot tell if the assumptions are completely satisfied. We have no information on the distributions of the PTSD scores’ normality or if their standard deviations are similar between all three groups.

Part (c)

We have the following table from the paper about group composition.

30 mg 75 mg 125 mg
Men 5 6 8
Women 2 1 4

Use a formal testing procedure to test whether the distribution of men and the distribution of women amongst the groups were the same. Follow the steps.

Is this a \(\chi^2\) goodness-of-fit test or a \(\chi^2\) independence test?

\(\bigcirc\) Goodness-of-fit

ANSWER Independence

Calculate the \(\chi^2\) test statistic.

Here are our observed values.

## # A tibble: 3 x 2
##     men women
##   <dbl> <dbl>
## 1     5     2
## 2     6     1
## 3     8     4

We calculate the totals.

##   men women totals
## 1   5     2      7
## 2   6     1      7
## 3   8     4     12
## 4  19     7     26

Now, calculate the expected values based on \(RC/N\) where \(R\) is the row total corresponding to the given cell, \(C\) is the column total corresponding to the given cell, and \(N\) is the full total.

##        men    women
## 1 5.115385 1.884615
## 2 5.115385 1.884615
## 3 8.769231 3.230769

Notice that I did not round.

Then, we build the \(\chi^2\) statistic by using \(\sum_{i=1}^{k}\frac{(O_i-E_i)^2}{E_i}\) where \(k\) is number of cells.

##          men       women
## 1 0.00260266 0.007064364
## 2 0.15297860 0.415227630
## 3 0.06747638 0.183150183

We have to sum everything up now.

stat <- sum(contributions)
stat
## [1] 0.8284998

What are the degrees of freedom for this test?
\((R-1) \cdot (C-1) = 2\).

Shade in your p-value.
Draw some chi-squared distribution. (It doesn’t necessarily have to look like this one perfectly.) Fill in the curve beyond the test statistic (to the right of). I didn’t show the shade in.

You would get credit if your answer looked like this as well. Remember, you’d have to shade in all to the right of the vertical line.

Write code to calculate your p-value.

pchisq(stat, df=(3-1)*(2-1))
## [1] 0.3391642

Part (d)

Recall the study used an ANOVA test to find results. After the ANOVA test, they conducted pairwise t-tests afterward. Why would they do that?
To understand the groups’ pairwise difference / to see which group contributed to rejecting \(H_0\).

Part (e)

After conducting corrected pairwise t-tests, the study found results that the (augmented) high dose 125 mg had significantly higher Clinician-Administered PTSD scale score results than the low dose group with a p-value of 0.01. Interpret this p-value. Is it statistically significant? What would you conclude?
Yes, at some significance level greater than 0.01. I would conclude that there is a difference in means PTSD scale score between those two said groups.

Part (f)

And here is another table from the study. Do you feel comfortable generalizing this study to all with PTSD?

30 mg 75 mg 125 mg
White 6 4 12
Latino 1 1 0
Native American 0 1 0
Native American & white 0 1 0

Not at all.

Part (g)

Which Disney character does the DAG in Part (a) look like?
Baymax. Come on, give me some harder questions.

Problem 3.

Let’s throw it back to 2002.

The Warriors played and won in the 2018 playoffs (reread this sentence a few times over if you’ve not yet understood this). The following contingency table contains information about Steph Curry.

Stephen Curry is a basketball player on the Golden State Warriors (GSW) well-known for his impressive 3 point shot. During the 2018 playoffs, Curry’s performance was jawdropping. Thus, we wanted to set up the following table to inspect some stats. This table contains counts of points corresponding to the last 4 playoff games.

Points made by Points scored by 3pt shot All other points scored Totals
Stephen Curry 66 44 110
All other Warriors 87 y 354
Totals 153 z x

Part (a)

Based on the above and per your calculations, how many points total did GSW score during these 4 games?

110+354
## [1] 464

Part (b)

What is the probability that Stephen Curry scored points from a 3 point shot?
This is \(P(\text{3 point shot}|\text{Steph})\).

n_steph     <- 110
n_steph_3pt <- 66

n_steph_3pt / n_steph
## [1] 0.6

Part (c)

What is the probability of all other Warriors scoring points from a 3 point shot?

87/354
## [1] 0.2457627

Part (d)

Which probability does the value of \(y/z\) from the table represent?

ANSWER \(\, \, \text{A.} \, \,\) Points were made by all other Warriors given the points were not scored from 3 point shots

\(\bigcirc \, \, \text{B.} \, \,\) Points scored were not from 3 point shots and the points were scored by all other Warriors

\(\bigcirc \, \, \text{C.} \, \,\) Points scored were not from 3 point shots given the points were scored by all other Warriors

Problem 4.

Arya and Brienne are both awesome warriors. They took upon themselves to take on students. As a competition, they tested their students and gave them a battle competency grade (out of 100).

This is a preview of the dataset.

variable value
brienne 80.77851
arya 56.30703
arya 74.11349
arya 55.52089
brienne 96.76028
brienne 83.20172
brienne 85.82450

Part (a)

What kind of plot is this?

Histogram or overlapping histogram. (It’s not dodged :P).

Part (b)

What kind of plot is this?

Dodged boxplot.

We are interested in testing whether Arya’s students or Brienne’s students are more combat competent.

Part (c)

Write the relevant hypotheses.
\(H_0\): The mean combat competency scores for Arya’s students and Brienne’s students are the same.
\(H_0\): The mean combat competency scores for Arya’s students and Brienne’s students are not the same.

Part (d)

Which tests can we NOT use to test these hypotheses?

ANSWER Two sample z-test
We don’t have the population standard deviation.

\(\square\) Two sample t-test

ANSWER Paired t-test
We have no info on matched pairs.

\(\square\) Permutation Test

\(\square\) Wilxocon Rank Sum Test

ANSWER Wilxocon Rank Sign Test
We have no info on matched pairs.

Part (e)

We are going to use a permutation test to test these data. Do we have distributional assumptions to satisfy for this test?
Nope. Nonparametrics require no distributional assumptions.

Part (f)

Permutation tests require you to reshuffle or permute labels on your dataset to simulate a null (a distribution where the null is true) distribution. To make statistical conclusions based on this dataset, we calculate the p-value which can be interpreted as the probability of observing my data or more extreme given that the null is true.

Part (g)

What is our observed difference between the groups?

## # A tibble: 2 x 2
##   variable `mean(value)`
##   <fct>            <dbl>
## 1 arya              69.1
## 2 brienne           90.6
91.42876-68.33121
## [1] 23.09755

Technically, this can also be written the other way, i.e. subtract Brienne from Arya.

Based on this simulated distribution, what do you expect our p-value to be?

We expect it to be near 0.

ggplot(data.frame(calculated_diffs), aes(x=calculated_diffs)) + 
  geom_histogram(binwidth=0.7, fill="gold2", col="black", lwd=0.2) +
  xlab("simulated distribution") +
  theme_minimal() +
  geom_vline(xintercept=obs_diff, lwd=3, alpha=0.7)

Problem 5.

Part (a)

Why would you use a bootstrap confidence interval instead of a normal-based confidence interval for testing the following hypotheses for any arbitrary dataset?

\[ H_0: m=100 \]

versus

\[ H_1: m \neq 100 \]

where \(m\) is the true median of our population.

Part (b)

Which of the following distributions do you simulate/estimate in order to create a bootstrap confidence interval?

\(\bigcirc\) Distribution under the \(H_0\)

\(\bigcirc\) Distribution under the alternative hypothesis

ANSWER Sampling distribution of \(m\)

\(\bigcirc\) Normal curve

Part (c)

Which of the functions is used when finding the upper and lower bound for a bootstrap confidence interval?

ANSWER quantile()

\(\bigcirc\) bootstrap()

\(\bigcirc\) aov()

\(\bigcirc\) summary()

Problem 6.

Figure out the type of test you’re going to use based on the problem. Then, carry out the test.

Part (a)

We have data on the birth years of presidents and their first people. Here is a random sample of that data.

president first_person
1924 1925
1882 1884
1946 1947
1843 1847
1890 1896
1917 1929
1795 1803
1857 1861
1809 1818
1913 1912

What kind of variables are in these columns?
Year is like age to time (in this case). Time is continuous, age is continuous, year is continuous (in this case).

We are interested in knowing whether the president tends to be older than their first person. What hypothesis might we test? It may be helpful to have the following output to answer the question.

dim(presidents)
## [1] 44  2

presidents <- presidents %>% mutate(new_col=president-first_person)
ggplot(presidents, aes(x=new_col)) + 
  geom_histogram(binwidth=1, col="black", fill="gold2", lwd=0.2) +
  theme_minimal()

We are interested in testing

\(H_0\): The differences between each president and that president’s first person is 0.
\(H_1\): The difference between each presdient at the president’s first person is greater than 0.

What tests can you use? (Hint: There’s more than one, but one will give you better results in this case.)

Because we have \(n=44\) for each of the groups, assume randomness, and have distributions of similar shapes, we’d opt for a one-sided paired t-test. The paired t-test will give you better results than a nonparametric test because we are able to use theoretical distributions as opposed to rankings (which lose information/power).

Part (b)

On Thursday mornings, an online shopper attempts to buy an item from the Supreme brand website. Say that probability of successfully purchasing an item on a given Thursday is 20%.

How many times per month do we expect the shopper to successfully order something?
Assuming all months have 4 Thursdays, then we would expect 0.8 successes per month. This is, in practical terms, 0 successes.

0.2*4
## [1] 0.8

What is the probability that the shopper will successfully order an item three Thursdays in a row?

0.2^3
## [1] 0.008

What is the probability that the shopper will successfully order 2 items during the span of three Thursdays?

choose(3,2)*(0.2)^3*(1-0.2)
## [1] 0.0192

Our shopper believes that it’s getting even more difficult to purchase an item off of the Supreme brand website lately. He and 30 of his friends decide to all try to purchase an item. Out of his group of friends and himself, 10 of them are able to purchase an item. Does the shopper have evidence that the probability of purchasing an item on a given Thursday is still 20%?

A status-quo belief is that \(P(\text{Success})=0.2\). A competing belief is that \(P(\text{Success})=10/31\). We want to test the following hypotheses.

\[ H_0: p=0.2 \]

versus

\[ H_0: p \neq 0.2. \]

Our \(\hat{p}=10/31\). Do we meet assumptions for the large sample z-test for a population proportion? We assume that we have randomness. We have 10 successes and 21 failures. Let’s try.

n <- 31
p_hat <- 10/31
n*p_hat
## [1] 10
n*(1-p_hat)
## [1] 21

Since these values are both \(\geq\) 10, and we assume that our population is at least 20 times larger than 31, we’ll go ahead and do a large sample z-test.

p_0 <- 0.2
se  <- sqrt(p_hat*(1-p_hat)/n)

z   <- (p_hat - p_0) / se
z
## [1] 1.460007

Then, we’ll get our p-value.

2*pnorm(1.46, lower.tail=FALSE)
## [1] 0.1442901
2*(1-pnorm(1.46))
## [1] 0.1442901

We interpret this p-value to mean if the true probability of success was really 0.2, then we would expect to see our data or more extreme 14.429% of the time just by random chance.

Because this value is not so small (it is greater than 10%, greater than 5%, greater than 1%), we fail to reject the null hypothesis!

To answer the question: No. There is no evidence.

Part (c)

We have conducted an ANOVA test between 5 different groups of friends and tested how quickly they could climb up a standard gym class rope. Our ANOVA test gave us a p-value of 0.0008.

How do we interpret the p-value?
Given that the mean rope times of the 5 different groups are truly the same, we would expect to see our observations or observations yielding more extreme results 0.08% of the time.

Can we tell which groups over/under-performed?
Not just by an ANOVA test. We’d need to do post-hoc procedures to figure this out.

Problem 7.

Here are a bunch of “pop” questions to test your knowledge.

- True or false. If two events are independent, then they are mutually exclusive.
False.

- What are the two types of categorical variables? Give examples.
Nominal and ordinal.

- Is age continuous or discrete?
Continuous.

- True or false. If my null hypothesized value of \(\mu\) is not within my confidence interval, then my p-value is less than my significance level \(\alpha\).
True. If \(\mu_0\) is not in our CI, then we are rejecting our null hypothesis. So, our p-value would be less \(\alpha\) because when our p-value is small, we reject our null.

- What do we always expect of our data in statistical inference?
To be pulled randomly.

- True or false. Correlation can be below 0.
True.

- Does correlation make a distinction between explanatory and response variables?
No.

- Under what conditions would we use the plus-four method?
When normality doesn’t kick-in. Short answer, \(n\) is 20x smaller than population, \(n\hat{p}\) and \(n(1-\hat{p})\) are both at least 10.

- In what ways can we increase the F statistic? (Theoretical is fine.)
- Decrease standard deviation within groups.
- Make the means of each of the groups further from each other.

(Not really an answer, but side note: to give our F statistic even more evidence, increase sample size.)

- Why do we learn 4 different methods to create CI for proportions? Which one would you use with small \(n\)?
Because our sample size and prooportions don’t always satisfy the large sample assumptions.

- Imagine we are doing a z-test for two proportions. What can be changed to make the confidence interval more narrow?
- Your confidence level should decrease.
- Sample size increases.
- If SE were smaller (though we can’t change this for real), then the CI would get smaller.

Problem 8.

Do the Bill Nye question on page 3 of this document.
Solutions are provided on the document.

Problem 9.

You are testing

\[ H_0: \mu = 10 \]

versus

\[ H_1: \mu < 10 \]

with data from an underlying normal distribution. Your data are as follows.

dim(data)
## [1] 25  1
data %>% summarize(mean=mean(data))
##       mean
## 1 9.078532

Are we calculating a z or t statistic?
t. We don’t have \(\sigma\).

What parameter(s) are relevant for the distribution that we plan to calcualte our p-value from? Give exact value(s).
Degrees of freedom, i.e. \(n-1\) for this case. That is \(25-1\).

Problem 10.

You plan to ask a random sample of San Franciscans their opinion on whether coffee is the best beverage.

How many San Franciscans must be interviewed to estimate the population proportion to within 3 percentage points with 95% confidence? Use 0.5 as the conservative guess for \(p\).
Recall

\[ n = \bigg(\frac{z^*}{m}\bigg)^2p^*(1-p^*) \]

In this case, \(z^*=2.96\), \(m=0.03\), and we are using \(p^*=0.5\). The “within” 3 percentage points mean that there would be 3% to each side of our estimate.

z_star <- 1.96
m      <- 0.03
p_star <- 0.5

(z_star/m)^2*p_star*(1-p_star)
## [1] 1067.111

Therefore, the answer is 1068. Don’t forget to round.

Would the sample size have to increase or decrease if we changed the desired margin of error from 3 percentage points to 2 percentage points? You do not have to calculate your answer.
The sample size would have to increase. Smaller margins of error call for larger samples. Notice that in the equation

\[ n = \bigg(\frac{z^*}{m}\bigg)^2p^*(1-p^*) \]

\(m\) is in the denominator. If we have \(m\) be smaller, then we have an overall larger number. Smaller denominators give greater overall values. (For example, 1/2 is bigger than 1/1000.)

What would the sample size be if we wanted the width of our interval to be 6 percentage points?
This has the same answer as the first problem. Within 3 percentages points is the same as saying the CI width is 6 percentage points (when dealing with symmetric CI’s about the estimate).

Problem 11.

What is statistical power? It’s the probability of rejecting the null hypothesis when, in fact, it is false. It is mathematically defined to be 1-\(\beta\), where \(\beta\) is the Type II error rate. (The probability of not making a Type II error.) If power is very close to 1, then it is very good at detecting a false null hypothesis.

How can you increase power? You can increase sample size, the effect size, or increase your \(\alpha\) (Type I Error rate).

It has been long recognized that there are only 2 elephant species: the Asian elephant and the African elephant. However, in recent years, scientists have concluded that African elephants have diverged into two different species. A wonderful scientific duo (and their 3 interesting children Debbie, Eliza, and Donnie) is driving their ComVee (communications vehicle) throughout Africa doing a documentary on the two African elephant species: the savanna elephants and forest elephants. By previous research, savanna elephant weights are normally distributed with a mean of 8,000 kg and a standard deviation of 1000kg. They have reason to believe that forest elephants weigh less. We believe that their average weight is at least about 1000 kg less.

Write the relevant hypotheses on the curiosity of forest elephant weight. \(H_0\): Forest elephants have a mean weight of 8,000 kg. \(H_1\): Forest elephants have a mean weight of less than 8,000 kg.

How many randomly sampled, independent elephants would need to be safely/ethically measured for your study if you wanted to have \(\alpha=0.05\) and \(\text{Power}=0.8\)?

Draw out two distributions. We have a set effect size based on our null belief of 8,000 kg and our alternative believe of 5,500 kg. We also have set values for power (0.8) and \(\alpha\). We want to calculate \(n\) in order to make power and \(\alpha\) “line up”. That is, if we have a certain sample size, then our distributions will change to allow for our power and \(\alpha\) to be of the size we desire. Recall that if you have a larger sample size, your sampling distribution shrinks in variance/standard deviation.

Our null distribution is to the right of our alternative distribution.

We can calculate the quantile corresponding to 20% of power on our normal alternative distribution by using

qnorm(0.8)
## [1] 0.8416212

and we can calculate the quantile corresponding to a 5% significance level \(\alpha\) by using

qnorm(0.05)
## [1] -1.644854

Now, we can use our formula to calculated sample size.

\[ n = \Big(\frac{(Z_{\alpha} + Z_{\beta}) * (\sigma)}{(\mu_{0} - \mu_{1})}\Big)^{2} \]

z_b   <- qnorm(0.8)
z_a   <- qnorm(0.05)
sigma <- 1000
num   <- (abs(z_a) + abs(z_b))*sigma
denom <- 7000-8000
(num/denom)^2
## [1] 6.182557

This says that we’d only need 7 elephants in our study. (Round up.) Why is this the case? Because the effect size is already so big. If this question was difficult, you had a question like this before (PS 8.1). Try to review that as well.