PH142
October 2018
Please also consult lecture notes and your text. Many of you have been asking for sample problems. For understanding the material, textbook review exercises are great.
Use Piazza for all your statistical questions!
Good luck studying to all!
Room Info:
Please bring:
1. Your ID
2. Calculator
3. Cheat sheet
4. Watch
Chapter 6, 7: Make your sample representative of your population!
Chapter 9, 10: Know when to apply your probability rules!
Chapter 11: Continuous Distribution, Normal
Chapter 12: Discrete Distributions, Binomial and Poisson
Chapter 13: Sampling Distribution, the distribution of your sample statistics
Chapter 6 has to do with the main idea of how to get the best estimates of your population with a sample. What you need to take from both chapters most is what constitutes to a good statistical setup to get estimates you care about. Understanding where the following terms play in a study are key.
The process of drawing conclusions about a population based on a sample
The process of drawing conclusions about a population based on a sample
When the expected value based on a sample differs from the true underlying parameter value.
The association between an exposure and an outcome is confounded if there exist one or more variables that are causes of the outcome that are also associated with the exposure of interest
Take Home Questions:
An explanatory variable that is being manipulated. There can be more than 1 factor.
A specific experimental condition. When there is more than 1 factor, then the treatment is a combination of specific values of each factor.
A response to a fake treatment because a person expects the treatment to be helpful
Chapter 9 and 10 are dense with formulas. Since you will most likely be putting these on your cheat sheet, your task is to be able to read word problems and apply the appropriate rules based on what the question is asking.
Look for patterns in language. Some words to look for are “given”, “and”, “or”, “both”, “neither”. Can you name more? Read some of your homework problems and see which words relate to which probability.
Random samples eliminate bias from the act of choosing a sample, but they can still be wrong
This is because of the variability that results when we choose at random.
If the variation when we take repeat samples from the same population is too great (really big), we can't trust the results of any one sample.
A probability model is a math description of a random phenomenon consisting of two parts: a sample space \( S \) and a way of assigning probabilities to events.
Remember doing this in your homework?
Rule 1. The probability P(A) of any event satisfies \( 0\leq P(A)\leq 1 \).
Rule 2. If S is the sample space in a probability model, then P (S) = 1.
Rule 3. Two events A and B are disjoint (mutually exclusive) if they have no outcomes in common and so can never occur together.
If A and B are disjoint, \( P(A U B) = P(A) + P(B) \). This is the addition rule for disjoint events.
Rule 4. The Complement Rule. For any event A,
\( P(A^C) = 1 - P (A) \)
where \( P(A^C) \) is the probability that A does not occur.
Take Home Question: Where does Rule 4 appear when we're using distributions?
Rule 4. The Complement Rule. For any event A,
\( P(A^C) = 1 - P (A) \)
where \( P(A^C) \) is the probability that A does not occur.
A probability model with a sample space made up of a list of individual outcomes is called discrete.
Example
e.g., Birth type: S = {vaginal, cesarean}
e.g., Daily soda consumption: S = {0, 1, 2, 3, 4+}
Relate this back to what you know about discrete data.
The sample space is now an entire interval of numbers:
S = {all numbers between 0 and 1}
Example
e.g., Annual income: S = {0 to \( \infty \)}
Relate this back to what you know about continuous data.
Question: Does the normal distribution have a density curve? Does the binomial?
The sample_n
function can be used in this way. The iris
dataset is built into R
!
library(dplyr)
sample_2 <- sample_n(iris, size = 2)
sample_2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
92 6.1 3 4.6 1.4 versicolor
113 6.8 3 5.5 2.1 virginica
The sample_n
function can be used in this way. The iris
dataset is built into R
!
sample_6 <- sample_n(iris, size = 6)
sample_6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
87 6.7 3.1 4.7 1.5 versicolor
118 7.7 3.8 6.7 2.2 virginica
51 7.0 3.2 4.7 1.4 versicolor
89 5.6 3.0 4.1 1.3 versicolor
140 6.9 3.1 5.4 2.1 virginica
84 6.0 2.7 5.1 1.6 versicolor
Now, we're sliding into Chapter 10 material.
See Sarah's slides on bCourses for Venn Diagram help. Visualizing with Venn Diagrams will change your life for Chapter 10 material.
Two events are independent if knowing that one event occurred does not change the probability that the other occurred
Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs. If A and B are independent,
\( P(A \cap B) = P(A)\times P(B) \)
Conversely, if this condition is not satisfied, then events A and B are dependent.
When P(A) > 0, the conditional probability of B, given A, is
\( P(B|A)=\frac{P(A\&B)}{P(A)} \)
For any two events A and B, \( P(A \cup B) = P(A) + P(B) - P(A \cap B) \).
The probability that both of two events A and B happen together can be found by:
This formula simplifies to \( P(A \cup B) = P(A) + P(B) \) when A and B are disjoint.
Hearing impairment in dalmatians Congenital sensorineural deafness is the most common form of deafness in dogs and is often associated with congenital pigmentation deficiencies.
A study of hearing impairment in dogs examined over five thousand dalmatians for both hearing impairment and iris color. Being Impaired was defined as deafness in either one or both ears. Dogs with one or both irises blue (a trait due to low iris pigmentation) were labeled blue.
The study found that 28% of the dalmatians were hearing impaired, 11% were blue eyed, and 5% were hearing impaired and blue eyed.
Hearing impairment in dalmatians Congenital sensorineural deafness is the most common form of deafness in dogs and is often associated with congenital pigmentation deficiencies.
A study of hearing impairment in dogs examined over 5000 dalmatians for both hearing impairment and iris color. Being Impaired was defined as deafness in either one or both ears. Dogs with one or both irises blue (a trait due to low iris pigmentation) were labeled blue.
The study found that 28% of the dalmatians were hearing impaired, 11% were blue eyed, and 5% were hearing impaired and blue eyed.
Question Write up information from the prompt in probability notation.
Solution
\( P(I) = 0.28 \)
\( P(B) = 0.11 \)
\( P(B \cap I) = 0.05 \)
Question What is the probability of dalmatians being either blue eyed or hearing impaired?
Solution
\( P(B \cup I) = P(B) + P(I) - P(B\cap I) = 0.11 + 0.28 - 0.05 = 0.34 \)
Additional Questions for Home:
If this was difficult to answer, please refer to Sarah's Venn Diagram notes on bCourses.
\( P(A \cap B) = P(A)P(B|A) \). This is simply a rearrangement of the conditional probability formula.
This simplifies to \( P(A \cap B) = P(A) \times P(B) \) when A and B are independent events.
\( P(A|B)=P(A) \) if independent. But if they aren't…
\[ P(A|B) =\frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)} \]
\[ P(A|B) =\frac{P(A|B)P(B)}{P(B|A)P(A) + P(B|A^c)P(A^c)} \]
\[ P(A|B) =\frac{P(A \cap B)}{P(B|A)P(A) + P(B|A^c)P(A^c)} \]
A common question has been WHEN DO WE USE THESE DISTRIBUTIONS!? Well, sometimes our data tell a story that we see over and over again.
We can think of distributions as things that model patterns in data. If we see data that looks approximately normal or follows a recipe for the Binomial or Poisson(*), then we should definitely consider using the appropriate distribution. When we use a distribution like one of the named three above, we can end up saying a lot about our data. We can say the mean, the variance/standard deviation, and calculate probabilities.
(*) See the following Chapter 11 and 12 notes for these recipes.
What is the difference between normal and standard normal?
According to basketball-reference.net, the mean height for NBA players is 6'7" = 79 inches. Suppose it is known that player height is normally distributed with a standard deviation of 4 inches. What is the probability of a ball player to be shorter than 6 feet?
The prompt information translates to:
mu <- 79
sd <- 4
k <- 6*12
We can calculate this probability one way (without tables!).
pnorm(q=k, mean=mu, sd=sd)
[1] 0.04005916
All the [ ]norm
functions are fair game!
Refer to your homework assignments for practice problems!
This is the formula for the Binomial distribution.
\[ P(X=k) = \Sigma_{k=0}^{n} \binom{n}{k}(p)^{k}(1-p)^{n-k} \]
Since you'll probably have this formula on your cheat sheet, what's more important is understanding what each of the pieces of the binomial distribution function.
As we said earlier, distributions are used to calculate means, variances, and probabilities of situations we see often! Think back to the Korean drama or pop song. They have well-known structures. So does the Binomial setting. Check if your data fit the Binomial setting.
Recipe for Binomial
1. Goals: We want to find a probability, these are all fine: less than/equal/greater than/combo
2. You have some fixed amount of trials
3. Trials are independent
4. The probability of success is the same for each trial
5. There are more assumptions, look for these at home.
When can we approximate the binomial as a normal distribution? Why?
A shopper goes online Thursday mornings to attempt to purchase Supreme apparel and accessories. Due to the brand's popularity and limited supply, the shopper successfully purchases any item 1 in 10 times. What is the probability that the shopper will make 4 successful purchases after 10 independent purchase attempts? How about greater than 4?
The prompt information translates to:
n_trials <- 10
k_success <- 4
probability <- 1/10
And we can calculate the first quantity “by hand” using the formula:
choose(n_trials, k_success)*(probability)^k_success*(1-probability)^(n_trials-k_success)
[1] 0.01116026
The prompt information translates to:
n_trials <- 10
k_success <- 4
probability <- 1/10
Or we can use a special function:
dbinom(x=k_success, size=n_trials, prob=probability)
[1] 0.01116026
The second quantity (the probability of greater than 4) can be found by using the Complement Rule.
1 - choose(n_trials, 4)*(probability)^(4)*(1-probability)^(n_trials-4) - choose(n_trials, 3)*(probability)^(3)*(1-probability)^(n_trials-3) - choose(n_trials, 2)*(probability)^(2)*(1-probability)^(n_trials-2) - choose(n_trials, 1)*(probability)^(1)*(1-probability)^(n_trials-1) - choose(n_trials, 0)*(probability)^(0)*(1-probability)^(n_trials-0)
[1] 0.001634937
Or using another cool function that calculate the sum of probabilities from 0 to k.
1-pbinom(q=k_success, size=n_trials, prob=probability)
[1] 0.001634937
Or even more conveniently:
pbinom(q=k_success, size=n_trials, prob=probability, lower.tail=FALSE)
[1] 0.001634937
All the [ ]binom
functions are fair game!
This is the formula for the Poisson distribution.
\[ P(X=k) = e^{-\lambda}\frac{\lambda^k}{k!} \]
Recipe for Poisson
1. Goals: We want to find a probability, these are all fine: greater/less/equal/combo
2. Events occur independently.
3. The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals.
3. There are more assumptions, look for these at home.
San Francisco is known for its fog. In fact, the fog's name is Karl! We expect a warm, sunny, Karl-free day in San Francisco about twice per month.
Do this at home: What is the probability that we see more than 5 days of sun in San Francisco in a month? How about exactly 10?
A sampling distribution is the distribution of estimates that we have for our parameter. In our case, we either make distributions of estimates for \( \mu \) or population proportion \( p \).
Here's a motivating figure from the text. After you're done looking over this section, reflect on this figure and see if you can understand it all.
As \( n \) gets huge, our sampling distribution's mean will get closer and closer to the true population paramter value.
Let's take a look at some NBA data that was scraped off of basketball-reference.com. We'll read in the csv.
western_conference <- read.csv("western_nba.csv")[,-1]
head(western_conference, 3)
Player millions team
1 Stephen Curry 37.45715 Golden State Warriors
2 Russell Westbrook 35.65415 Oklahoma City Thunder
3 Chris Paul 35.65415 Houston Rockets
mean(western_conference$millions)
[1] 6.500936
In the following blocks, we're going to take 100 samples of size \( n=2, 5, 30 \). Ignore the code that you have you never learned for midterm 1. Focus on the output.
sample_means_2 <- do.call(rbind, lapply(1:100, function(x) sample_n_salaries(2)))
sample_means_2 %>% summarize(sampling_mean=mean(mean_salary))
sampling_mean
1 6.680716
sample_means_5 <- do.call(rbind, lapply(1:100, function(x) sample_n_salaries(5)))
sample_means_5 %>% summarize(sampling_mean=mean(mean_salary))
sampling_mean
1 6.512029
sample_means_30 <- do.call(rbind, lapply(1:100, function(x) sample_n_salaries(30)))
sample_means_30 %>% summarize(sampling_mean=mean(mean_salary))
sampling_mean
1 6.517297
As \( n \) gets huge, our sampling distribution itself will look more and more like a normal distribution.
Recall the data from earlier.
head(western_conference, 5)
Player millions team
1 Stephen Curry 37.45715 Golden State Warriors
2 Russell Westbrook 35.65415 Oklahoma City Thunder
3 Chris Paul 35.65415 Houston Rockets
4 LeBron James 35.65415 Los Angeles Lakers
5 Paul George 30.56070 Oklahoma City Thunder
This is the distribution of the population salaries.
What is the mean parameter \( \mu \) of the population? We can calculate this by hand or by a function. Focus on the by hand calculation.
Here's the “by hand” calculation. The following functions may seem new to you, but think intuitively instead of code-wise. We have a data on all of the salaries, then we add all the salaries up, then divide by the number of salaries we have to get the average.
vector_of_salaries <- western_conference$millions
sum(vector_of_salaries) / length(vector_of_salaries)
[1] 6.500936
What is the mean parameter \( \mu \) of the population? We can calculate this by hand or by a function. Focus on the by hand calculation.
Here's the dplyr
solution.
library(dplyr)
western_conference %>% summarize(mu=mean(vector_of_salaries))
mu
1 6.500936
Let's see what our sample distribution looks like for several choices of \( n \).
ggplot(sample_means_2, aes(x=mean_salary)) +
geom_histogram(binwidth=0.7) +
scale_x_continuous(limits = c(0, 40))
ggplot(sample_means_5, aes(x=mean_salary)) +
geom_histogram(binwidth=0.5) +
scale_x_continuous(limits = c(0, 40))
Visually, the sample size of \( n=30 \) is the most symmetric.
ggplot(sample_means_30, aes(x=mean_salary)) +
geom_histogram(binwidth=0.5) +
scale_x_continuous(limits = c(0, 40))
As \( n \) gets larger, then we see the sampling distribution get less skewed and more like the normal curve.
Sampling distributions have means and variances that can be calculated as specified below!
mean variance
counts mu sigma^2/n
proportions p (p(1-p))/n
Question: What is the relationship between variance and standard deviation?
Question for Home: What assumptions go along with sampling distributions and their normalities? Check the chapter.
Extra Example 1 We expect there to be only 2 BART delays per week because BART is so efficient and amazing and uses its few-rail system wisely. What's the probability of there being 2 delays in a day?
?dpois
?ppois
Extra Example 2 Carl Gauss in the house! Check out his signature.
Extra Example 2 Carl Gauss in the house! Carl Gauss was a very smart man. He made discoveries on the weekly. Assume that the number of discoveries he made per week can be modeled by the Gaussian distribution \( N(\mu=5,\sigma=7) \).
Extra Example 3 Check out the NBA data!
Team Freq
1 Golden State Warriors 20
2 Los Angeles Clippers 20
3 Los Angeles Lakers 16
4 Sacramento Kings 21
Pool all the California NBA players together. If we take a random sample of size 12 of these huge humans, what is the probability that we choose exactly 5 of them from the Warriors?
True or False? The normal distribution is a discrete distribution.
True or False? The mean of a Poisson distribution is the variance of the Poisson distribution.
True or False? Sample proportion \( \hat p \) is an unbiased estimator for population parameter \( p \).
True or False? A binomial distribution can have three outcomes.