Lab 4. Probably about probability.

Maybe

Goals

(1) Calculate conditional probabilities for tables for two categorical variables
(2) And for three categorical variables

In [7]:
library(dplyr)
library(tidyverse)

Conditional probability for 2 variables

When you have 2 categorical variables, you can summarize their counts in a contingency table.

` Category 2A Category 2B Total
Category 1A a c a+c
Category 1B b d b+d
Total a+b c+d a+b+c+d

I want to give you a chance to think about stats that's for a quick second not in the context of public health. How about we look at famous rappers? I have already read in a dataset.

In [9]:
rappers_subset <- rappers %>% filter(origin %in% c("New-York", "California"))
head(rappers_subset)
artist_namelegal_namebirth_yearoriginnet_worthstart_yeardeceasedheightheight_decimaltallageactive
Nicki-Minaj Onika-Maraj 1982 New-York 75 2004 FALSE 5'2" 5.166667 FALSE 37 15
Jay-Z Shawn-Carter 1969 New-York 900 1986 FALSE 6'2" 6.166667 TRUE 50 33
Kendrick-Lamar Kendrick-Duckworth1987 California 45 2003 FALSE 5'6" 5.500000 FALSE 32 16
E-40 Earl-Stevens 1967 California 10 1986 FALSE 6'1" 6.083333 TRUE 52 33
Nas Nasir-Jones 1973 New-York 50 1991 FALSE 5'8" 5.666667 FALSE 46 28
Jadakiss Jason-Phillips 1975 New-York 6 1991 FALSE 5'7" 5.583333 FALSE 44 28

While not all of the rappers in this dataset are from California or New York, a lot them were, so I decided to take a look at just California and New York rappers who are "tall" (above mean height) or not.

In [10]:
table(rappers_subset$origin, rappers_subset$tall)
            
             FALSE TRUE
  California     9    7
  New-York       8   14

Let's calculate some conditional probabilities. What is the probability that you're tall given that you're a famous rapper from New York?

In [11]:
table(rappers_subset$origin, rappers_subset$tall)
            
             FALSE TRUE
  California     9    7
  New-York       8   14

First look at the row that has to do with New York. Get that total. Then, look at the column that has to do with "TRUE".

$$ \begin{align} P(Tall|New York) &= \frac{\text{# Tall and New York}} {\text{# New York}} \\ &= 14 / (8+14) \\ &= 14 / 22 \approx 0.363 \end{align} $$

Conditional probability for 2 variables

Vocabulary

  • Sensitivity: $P(+|D)$ or the probability of test being positive if one has the disease.
  • Specificity: $P(-|D')$ or the probability of test being negative given one does not have the disease.
  • Positive predictive value (PPV): $P(D|+)$ or the probability of having the disease if an individual tests positive.
  • Negative predictive value (NPV): $P(D'|-)$ or the probability of not having the disease if an individual tests negative.

This set up implies that a disease is rare, but that a very accurate test exists (i.e., equally high sensitivity and specificity).

$n = 10,000$
$P(D) = 0.05$
$P(+|D) = 0.95$
$P(-|D') = 0.95$

` $D$ $D'$ Total
+ a c a+c
- b d b+d
Total a+b c+d a+b+c+d

We now know $a+b+c+d=10000$ and $a+b=10000 \cdot P(D)$. Now, let's play Sudoku.

Now that we have this contingency table, how would we begin to calculate the PPV?

` $D$ $D'$ Total
+ 475 475 950
- 25 9025 9050
Total 500 9500 10000

$PPV = P(D|+) = 475/950 = 0.5$

Re-do the two-way table and re-calculate the PPV, but assume $P(D)=0.02$ this time.

$n = 10,000$
$P(D) = 0.02$
$P(+|D) = 0.95$
$P(-|D') = 0.95$

` $D$ $D'$ Total
+ 190 490 680
- 10 9310 9320
Total 200 9800 10000

Now, calculate the PPV.

$PPV = P(D|+) = 190/680 = 0.2794118$

Sensitivity is high (0.95), but the PPV (the probability of having the disease given you have a positive test) is low for the first calculation and even lower for the second calculation.

When disease is rare, the chance of not having the disease is very common.

There will be more false positives than true positives, leading to a low PPV.
(Of all the tests that were positive, a high proportion of them will be false positive.)

Conditional probabilities for 3 (or more) variables

We will be working with the below data. This table looks quite a bit different than what we were looking at earlier. Each row is associated with a number of subjects that share in common the three categorical variable values.

In [12]:
chd <- read.csv("data/Lab5_CHD.csv")
chd
AgeSmokingCHDn
youngyes yes 60
youngyes no 240
youngno yes 105
youngno no 595
old yes yes 180
old yes no 120
old no yes 210
old no no 490

Calculating $P(D|A′,S)$

The probability of disease given that someone is young and does smoke. First, we calculate the total amount of subjects that are young and smoke. Next, we calculate amount of those who have CHD.

In [13]:
chd %>% filter(Age=="young") %>% filter(Smoking=="yes")
AgeSmokingCHDn
youngyes yes 60
youngyes no 240
In [14]:
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% pull(n))
300
In [15]:
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% filter(CHD=="yes") %>% pull(n))
60
In [16]:
60/300
0.2

Calculating $P(D|A′,S′)$

The probability of disease given that someone is young and does not smoke.

In [17]:
chd %>% filter(Age=="young") %>% filter(Smoking=="no")
AgeSmokingCHDn
youngno yes 105
youngno no 595
In [18]:
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% pull(n))
700
In [19]:
sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% filter(CHD=="yes") %>% pull(n))
105
In [20]:
105/700
0.15

On your own, calculate these values.

  • $P(D|A,S)$ and $P(D|A,S′)$. What can we conclude based on the calculated conditional probabilities?
  • $P(D)$, the marginal probability of disease.

$P(D|A',S) = 0.2$ is greater than $P(D|A',S') = 0.15$.
$P(D|A,S) = 0.6$ is greater than $P(D|A,S′) = 0.3$.

$P(D) = \frac {\text{# with disease}} {\text{# of subjects total}} = \frac{60+105+180+210}{N} = 0.2775$

In [21]:
sum(chd %>% filter(CHD=="yes") %>% pull(n)) / sum(chd %>% pull(n))
0.2775

This chunk should be run right after the libraries are read in . I put this chunk down here to keep the lab focusing on concepts instead of coding.

In [8]:
# * DON'T WORRY ABOUT THIS ENTIRE CHUNK
rappers <- read_csv("../../notes/rappers/data/rappers.csv")

# * CONVERTING HEIGHT INTO FEET
height_to_decimal <- function(this_height) {
    numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
    numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")

# * USING MUTATE TO ADD USEFUL VALUES
rappers <- rappers %>% mutate(height_decimal=height_to_decimal(height))
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_decimal))
rappers <- rappers %>% mutate(tall=height_decimal > mean(height_decimal))

rappers <- rappers %>% mutate(age=2019-birth_year,
                              active=2019-start_year)
Parsed with column specification:
cols(
  artist_name = col_character(),
  legal_name = col_character(),
  birth_year = col_integer(),
  origin = col_character(),
  net_worth = col_double(),
  start_year = col_integer(),
  deceased = col_logical(),
  height = col_character()
)