library(dplyr)
library(tidyverse)
When you have 2 categorical variables, you can summarize their counts in a contingency table.
` | Category 2A | Category 2B | Total |
---|---|---|---|
Category 1A | a | c | a+c |
Category 1B | b | d | b+d |
Total | a+b | c+d | a+b+c+d |
I want to give you a chance to think about stats that's for a quick second not in the context of public health. How about we look at famous rappers? I have already read in a dataset.
rappers_subset <- rappers %>% filter(origin %in% c("New-York", "California"))
head(rappers_subset)
While not all of the rappers in this dataset are from California or New York, a lot them were, so I decided to take a look at just California and New York rappers who are "tall" (above mean height) or not.
table(rappers_subset$origin, rappers_subset$tall)
Let's calculate some conditional probabilities. What is the probability that you're tall given that you're a famous rapper from New York?
table(rappers_subset$origin, rappers_subset$tall)
First look at the row that has to do with New York. Get that total. Then, look at the column that has to do with "TRUE".
$$ \begin{align} P(Tall|New York) &= \frac{\text{# Tall and New York}} {\text{# New York}} \\ &= 14 / (8+14) \\ &= 14 / 22 \approx 0.363 \end{align} $$This set up implies that a disease is rare, but that a very accurate test exists (i.e., equally high sensitivity and specificity).
$n = 10,000$
$P(D) = 0.05$
$P(+|D) = 0.95$
$P(-|D') = 0.95$
` | $D$ | $D'$ | Total |
---|---|---|---|
+ | a | c | a+c |
- | b | d | b+d |
Total | a+b | c+d | a+b+c+d |
We now know $a+b+c+d=10000$ and $a+b=10000 \cdot P(D)$. Now, let's play Sudoku.
Now that we have this contingency table, how would we begin to calculate the PPV?
` | $D$ | $D'$ | Total |
---|---|---|---|
+ | 475 | 475 | 950 |
- | 25 | 9025 | 9050 |
Total | 500 | 9500 | 10000 |
$PPV = P(D|+) = 475/950 = 0.5$
Re-do the two-way table and re-calculate the PPV, but assume $P(D)=0.02$ this time.
$n = 10,000$
$P(D) = 0.02$
$P(+|D) = 0.95$
$P(-|D') = 0.95$
` | $D$ | $D'$ | Total |
---|---|---|---|
+ | 190 | 490 | 680 |
- | 10 | 9310 | 9320 |
Total | 200 | 9800 | 10000 |
Now, calculate the PPV.
$PPV = P(D|+) = 190/680 = 0.2794118$
Sensitivity is high (0.95), but the PPV (the probability of having the disease given you have a positive test) is low for the first calculation and even lower for the second calculation.
When disease is rare, the chance of not having the disease is very common.
There will be more false positives than true positives, leading to a low PPV.
(Of all the tests that were positive, a high proportion of them will be false positive.)
We will be working with the below data. This table looks quite a bit different than what we were looking at earlier. Each row is associated with a number of subjects that share in common the three categorical variable values.
chd <- read.csv("data/Lab5_CHD.csv")
chd
The probability of disease given that someone is young and does smoke. First, we calculate the total amount of subjects that are young and smoke. Next, we calculate amount of those who have CHD.
chd %>% filter(Age=="young") %>% filter(Smoking=="yes")
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% pull(n))
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% filter(CHD=="yes") %>% pull(n))
60/300
The probability of disease given that someone is young and does not smoke.
chd %>% filter(Age=="young") %>% filter(Smoking=="no")
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% pull(n))
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% filter(CHD=="yes") %>% pull(n))
105/700
On your own, calculate these values.
$P(D|A',S) = 0.2$ is greater than $P(D|A',S') = 0.15$.
$P(D|A,S) = 0.6$ is greater than $P(D|A,S′) = 0.3$.
$P(D) = \frac {\text{# with disease}} {\text{# of subjects total}} = \frac{60+105+180+210}{N} = 0.2775$
sum(chd %>% filter(CHD=="yes") %>% pull(n)) / sum(chd %>% pull(n))
# * DON'T WORRY ABOUT THIS ENTIRE CHUNK
rappers <- read_csv("../../notes/rappers/data/rappers.csv")
# * CONVERTING HEIGHT INTO FEET
height_to_decimal <- function(this_height) {
numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")
# * USING MUTATE TO ADD USEFUL VALUES
rappers <- rappers %>% mutate(height_decimal=height_to_decimal(height))
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_decimal))
rappers <- rappers %>% mutate(tall=height_decimal > mean(height_decimal))
rappers <- rappers %>% mutate(age=2019-birth_year,
active=2019-start_year)