# * LIBRARIES
library(dplyr)
library(ggplot2)
library(broom)
# * LOAD IN DATA
rappers <- read.csv("../data/rappers.csv")
This chunk is adding in useful columns. Don't worry about being able to write a function.
# * DON'T WORRY ABOUT THIS PART :D
height_to_decimal <- function(this_height) {
numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")
# * WORRY ABOUT THIS PART
# * USING MUTATE TO ADD USEFUL VALUES
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_to_decimal(height)))
rappers <- rappers %>% mutate(age=2019-birth_year,
active=2019-start_year)
head(rappers)
Here's where this topic really starts! In this section, we're going to be covering how to work with categorical data. Categorical data again are data that fall under characteristics rather than quantities. (Just as a self-check, INTERVALS of quantities are also categorical! Review this if this is fishy to you.)
I'm going to make a contingency table or a two-way table based off the following subset of just Californian and New York rappers.
library(dplyr)
cat_subset <- rappers %>% filter(origin %in% c("New-York", "California"))
cat_subset <- cat_subset %>% mutate(tall=height_decimal > mean(height_decimal))
cat_subset <- cat_subset %>% mutate(vet=active > mean(active))
cat_subset <- cat_subset %>% select(artist_name, origin, tall, vet)
head(cat_subset)
Here's a two-way table about being a rap vet and where the rapper's origin is.
contingency_table <- table(as.character(cat_subset$origin), cat_subset$vet)
contingency_table
We can visualize categorical variables by making a dodged bar plot. Below is one that separates California and New York rappers, then further separates rappers who have been in the rap game for more years than the average from those who are more new to the game.
ggplot(cat_subset, aes(x=origin, fill=vet)) +
geom_bar(position="dodge") +
ggtitle("Counts of Rap Veterans in CA and NY")
The below plot takes the visualization a "step forward". It may not be the most understandable, but we now have added in counts for whether or not a rapper is tall. We use facet_wrap()
to separate the two states as well. (Which is true geographically and I would definitely suppose historically as far as competitive nature goes.)
ggplot(cat_subset, aes(x=tall, fill=vet)) +
geom_bar(position="dodge") +
ggtitle("How Rapper Height, 'Veteran' Status, and Origin Relate") +
facet_wrap(~origin)
What is the conditional probability of California given that the rapper is a veteran of rap (if you haven't read any of the above, this has been defined as the rapper has been active longer than the mean amount of active time)?
To answer this question, let's look at our contingency table.
contingency_table
What is the conditional probability of a Tall given that a rapper is from New York? We can calculate this similarly, but we need to make a contingency table for that context.
contingency_table <- table(as.character(cat_subset$origin), cat_subset$tall)
contingency_table
Now that we have this contingency table, we can take a look at the column for "TRUE" which means they are truly tall in this case, then look at the cell that has to do with New York. Our probability ends up being:
$P(Tall|NY) = 11 / 22 = 1/2$
And since we calculated this value, we can use the complement rule to figure out $P(Short|NY) = 1 - P(Tall|NY) = 1/2$.
In order to calculate the probability given two pieces of information (tall and vet), we need to take a look at counts. Because we're working with three categorical variables, we can't look at just one contingency table anymore. The following is an extension of the contingency tables we were looking at before.
counts_table <- data.frame(table(origin=as.character(cat_subset$origin), vet=cat_subset$vet, tall=cat_subset$tall))
names(counts_table)[4] <- "n"
counts_table <- counts_table %>% arrange(origin)
counts_table
The column on the very right side corresponds to how many rappers fall under the same values for each category. Take a minute to inspect the table. It seems that there are not a lot of Californian non-veterans who are not tall, and there are a lot of New York veterans who are tall.
To get the probability of $P(CA|Tall,Vet)$, we can now think:
tall
and vet
columns.counts_table %>% filter(vet==TRUE, tall==TRUE)
We won't always have the counts... for whatever reason... of our variables of interest. Sometimes, we will just have probabilities. We're about to switch gears. I'm going to make up some values.
Let $G$ be the event that a given rapper gets a grammy within the last five years. Let $P(G)=0.04$, it being pretty darn rare. Remember now that our dataset has $n=75$ subjects (rappers). Now, let $P(Vet|G)=2/3$ and $P(Vet'|G')=11/12$.
Let's start using the information to fill out the table!
Say we are interested in calculating the probability of getting a grammy given that a rapper is not a rap vet. We can start thinking in terms of a contingency table.
` | $Vet$ | $Vet'$ | Total |
---|---|---|---|
$G$ | a | d | g |
$G'$ | b | e | h |
Total | c | f | $N=75$ |
To get $g$, we can use $g=75*P(G)=73(0.04)=3$.
` | $Vet$ | $Vet'$ | Total |
---|---|---|---|
$G$ | a | d | 3 |
$G'$ | b | e | h |
Total | c | f | $N=75$ |
Then to get $h$, we subtract $g$ from 75. Thus, $h=72$.
` | $Vet$ | $Vet'$ | Total |
---|---|---|---|
$G$ | a | d | 3 |
$G'$ | b | e | 72 |
Total | c | f | $N=75$ |
Recall that we have Let $P(Vet'|G') = 11/12$ and $P(Vet|G) = 2/3$.
` | $Vet$ | $Vet'$ | Total |
---|---|---|---|
$G$ | a=3(2/3)=2 | d | 3 |
$G'$ | b | e=72(11/12)=66 | 72 |
Total | c | f | $N=75$ |
Finally, we can use subtraction and addition to work out the rest.
` | $Vet$ | $Vet'$ | Total |
---|---|---|---|
$G$ | a=3(2/3)=2 | 3-2=1 | 3 |
$G'$ | 72-66=6 | e=72(11/12)=66 | 72 |
Total | c=2+6=8 | 67 | $N=75$ |
Now, you have a table that you can use to write out more conditional and marginal frequencies. Have a hack at it!