Lab 4. Probably about probability.¶

Maybe

Goals¶

(1) Calculate conditional probabilities for tables for two categorical variables
(2) And for three categorical variables

library(dplyr)
library(tidyverse)

Conditional probability for 2 variables¶

When you have 2 categorical variables, you can summarize their counts in a contingency table.

`	Category 2A	Category 2B	Total
Category 1A	a	c	a+c
Category 1B	b	d	b+d
Total	a+b	c+d	a+b+c+d

I want to give you a chance to think about stats that's for a quick second not in the context of public health. How about we look at famous rappers? I have already read in a dataset.

rappers_subset <- rappers %>% filter(origin %in% c("New-York", "California"))
head(rappers_subset)

While not all of the rappers in this dataset are from California or New York, a lot them were, so I decided to take a look at just California and New York rappers who are "tall" (above mean height) or not.

table(rappers_subset$origin, rappers_subset$tall)

            
             FALSE TRUE
  California     9    7
  New-York       8   14

Let's calculate some conditional probabilities. What is the probability that you're tall given that you're a famous rapper from New York?

table(rappers_subset$origin, rappers_subset$tall)

            
             FALSE TRUE
  California     9    7
  New-York       8   14

First look at the row that has to do with New York. Get that total. Then, look at the column that has to do with "TRUE".

$$ \begin{align} P(Tall|New York) &= \frac{\text{# Tall and New York}} {\text{# New York}} \\ &= 14 / (8+14) \\ &= 14 / 22 \approx 0.363 \end{align} $$

Conditional probability for 2 variables¶

Vocabulary¶

Sensitivity: $P(+|D)$ or the probability of test being positive if one has the disease.
Specificity: $P(-|D')$ or the probability of test being negative given one does not have the disease.
Positive predictive value (PPV): $P(D|+)$ or the probability of having the disease if an individual tests positive.
Negative predictive value (NPV): $P(D'|-)$ or the probability of not having the disease if an individual tests negative.

This set up implies that a disease is rare, but that a very accurate test exists (i.e., equally high sensitivity and specificity).

$n = 10,000$
$P(D) = 0.05$
$P(+|D) = 0.95$
$P(-|D') = 0.95$

`	$D$	$D'$	Total
+	a	c	a+c
-	b	d	b+d
Total	a+b	c+d	a+b+c+d

We now know $a+b+c+d=10000$ and $a+b=10000 \cdot P(D)$. Now, let's play Sudoku.

Now that we have this contingency table, how would we begin to calculate the PPV?

`	$D$	$D'$	Total
+	475	475	950
-	25	9025	9050
Total	500	9500	10000

$PPV = P(D|+) = 475/950 = 0.5$

Re-do the two-way table and re-calculate the PPV, but assume $P(D)=0.02$ this time.

$n = 10,000$
$P(D) = 0.02$
$P(+|D) = 0.95$
$P(-|D') = 0.95$

`	$D$	$D'$	Total
+	190	490	680
-	10	9310	9320
Total	200	9800	10000

Now, calculate the PPV.

$PPV = P(D|+) = 190/680 = 0.2794118$

Sensitivity is high (0.95), but the PPV (the probability of having the disease given you have a positive test) is low for the first calculation and even lower for the second calculation.

When disease is rare, the chance of not having the disease is very common.

There will be more false positives than true positives, leading to a low PPV.
(Of all the tests that were positive, a high proportion of them will be false positive.)

Conditional probabilities for 3 (or more) variables¶

We will be working with the below data. This table looks quite a bit different than what we were looking at earlier. Each row is associated with a number of subjects that share in common the three categorical variable values.

chd <- read.csv("data/Lab5_CHD.csv")
chd

Calculating $P(D|A′,S)$¶

The probability of disease given that someone is young and does smoke. First, we calculate the total amount of subjects that are young and smoke. Next, we calculate amount of those who have CHD.

chd %>% filter(Age=="young") %>% filter(Smoking=="yes")

sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% pull(n))

sum(chd %>% filter(Age=="young") %>% filter(Smoking=="yes") %>% filter(CHD=="yes") %>% pull(n))

60/300

Calculating $P(D|A′,S′)$¶

The probability of disease given that someone is young and does not smoke.

chd %>% filter(Age=="young") %>% filter(Smoking=="no")

sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% pull(n))

sum(chd %>% filter(Age=="young") %>% filter(Smoking=="no") %>% filter(CHD=="yes") %>% pull(n))

105/700

On your own, calculate these values.

$P(D|A,S)$ and $P(D|A,S′)$. What can we conclude based on the calculated conditional probabilities?
$P(D)$, the marginal probability of disease.

$P(D|A',S) = 0.2$ is greater than $P(D|A',S') = 0.15$.
$P(D|A,S) = 0.6$ is greater than $P(D|A,S′) = 0.3$.

$P(D) = \frac {\text{# with disease}} {\text{# of subjects total}} = \frac{60+105+180+210}{N} = 0.2775$

sum(chd %>% filter(CHD=="yes") %>% pull(n)) / sum(chd %>% pull(n))

This chunk should be run right after the libraries are read in . I put this chunk down here to keep the lab focusing on concepts instead of coding.¶

# * DON'T WORRY ABOUT THIS ENTIRE CHUNK
rappers <- read_csv("../../notes/rappers/data/rappers.csv")

# * CONVERTING HEIGHT INTO FEET
height_to_decimal <- function(this_height) {
    numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
    numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")

# * USING MUTATE TO ADD USEFUL VALUES
rappers <- rappers %>% mutate(height_decimal=height_to_decimal(height))
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_decimal))
rappers <- rappers %>% mutate(tall=height_decimal > mean(height_decimal))

rappers <- rappers %>% mutate(age=2019-birth_year,
                              active=2019-start_year)

Parsed with column specification:
cols(
  artist_name = col_character(),
  legal_name = col_character(),
  birth_year = col_integer(),
  origin = col_character(),
  net_worth = col_double(),
  start_year = col_integer(),
  deceased = col_logical(),
  height = col_character()
)

artist_name	legal_name	birth_year	origin	net_worth	start_year	deceased	height	height_decimal	tall	age	active
Nicki-Minaj	Onika-Maraj	1982	New-York	75	2004	FALSE	5'2"	5.166667	FALSE	37	15
Jay-Z	Shawn-Carter	1969	New-York	900	1986	FALSE	6'2"	6.166667	TRUE	50	33
Kendrick-Lamar	Kendrick-Duckworth	1987	California	45	2003	FALSE	5'6"	5.500000	FALSE	32	16
E-40	Earl-Stevens	1967	California	10	1986	FALSE	6'1"	6.083333	TRUE	52	33
Nas	Nasir-Jones	1973	New-York	50	1991	FALSE	5'8"	5.666667	FALSE	46	28
Jadakiss	Jason-Phillips	1975	New-York	6	1991	FALSE	5'7"	5.583333	FALSE	44	28

Age	Smoking	CHD	n
young	yes	yes	60
young	yes	no	240
young	no	yes	105
young	no	no	595
old	yes	yes	180
old	yes	no	120
old	no	yes	210
old	no	no	490