The dataset we will be using is information collected on rappers from Google searches. To be clear, these data were chosen with a lot of bias. I lack plenty of hip-hop history, and I am a big fan of rap, therefore I have my favorites who are probably represented in the dataset. The first 30 or so artists (lazy to figure it out at this point) were chosen off the top of my head. The rest were chosen from looking at lists found under the search query of "best rappers" and "most influential rappers". Net worth and height were recorded from the Google hint that appeared after searching artist names then the word "net worth" or "height". The rest of the information was taken from their Wikipedia pages. If no origin/hometown was specified, then their place of birth was taken as origin state.
The dataset is called rappers.csv
and is downloadable for you to do your own practice on! Let's load in the data and take a look.
library(readr)
rappers <- read_csv("../data/rappers.csv")
The columns within rappers
represent values for the following variables.
names(rappers)
We can try to summarize and visualize our dataset.
We're going to use the library dplyr
to calculate some interesting summaries. Take a look at the following chunks and see what they're doing.
library(dplyr)
Often times, data will come to us in a format that isn't easy to work with in R like getting a height in the format of 5'2" instead fo 5.1667 feet. Our next goal will be to convert the height
column into decimal format. To do this we will use mutate()
to apply the function called height_to_decimal
. Don't worry about coding the following function. Just take it as a given.
# * DON'T WORRY ABOUT THIS CHUNK :D
height_to_decimal <- function(this_height) {
numeric <- sapply(strsplit(gsub("\"", "", as.character(this_height)), "\'"), as.numeric)
numeric[1] + (numeric[2] / 12)
}
height_to_decimal <- Vectorize(height_to_decimal, vectorize.args="this_height")
Recall, we can use mutate()
to take an existing column and create a new one based off of the values of the existing one. Take a look at the new column we tacked on to the very right side. Also notice that I overwrote my previous dataframe with my updated version that contains the height_decimal
column using the assignment operator.
rappers <- rappers %>% mutate(height_decimal=height_to_decimal(height))
rappers <- rappers %>% mutate(height_decimal=as.numeric(height_decimal))
head(rappers)
The last example can be a little confusing if you look at the chunk I told you not to look at. Haha. To reiterate how to use mutate()
, we're going to now calculate each artist's year (rounded to the start of the year) and each artist's number of active years. (For now, we will not worry about whether or not a rapper is still around on this Earth.)
rappers <- rappers %>% mutate(age=2019-birth_year,
active=2019-start_year)
head(rappers)
Now, we're going to make summaries and data visualizations using dplyr
and ggplot2
. Recall that earlier we already loaded in dplyr
, so we're not going to do it again. We're going to use the two libraries one after the other to show how we can calculate (code) and view our data summaries.
library(ggplot2)
Birth year is discrete, numeric data.
rappers %>% summarize(mean_birth = mean(birth_year),
median_birth = median(birth_year),
early_birth = min(birth_year),
late_birth = max(birth_year))
Here's a boxplot of the interquartile range (IQR).
ggplot(rappers, aes(y=birth_year)) +
geom_boxplot() +
geom_hline(yintercept=1964) +
geom_hline(yintercept=1999) +
geom_hline(yintercept=1980.787, color="gold") +
geom_hline(yintercept=1980, color="royalblue") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplot(rappers, aes(x=birth_year)) +
geom_bar() +
geom_vline(xintercept=1980.787, color="gold") +
geom_vline(xintercept=1980, color="royalblue")
ggplot(rappers, aes(x=birth_year)) +
geom_histogram(binwidth=2, col="white", lwd=0.25) +
geom_vline(xintercept=1980.838, color="gold") +
geom_vline(xintercept=1981, color="royalblue")
Too $hort is one of the rappers in our dataset. Here's a little bit of information we have about him from our dataset.
rappers %>% filter(artist_name=="Too-$hort")
Is Too \$hort too short? Let's compare him to the rest of our rappers. From the above, we know he is 5'7". By the below, Too \\$hort does happen to be shorter than the first quartile of heights.
summary <- rappers %>% summarize(mean_height = mean(height_decimal),
median_height = median(height_decimal),
short = quantile(height_decimal, 0.25),
tall = quantile(height_decimal, 0.75))
summary
Rapper height is a continuous variable, so we're going to use a histogram to visualize it.
plottable_mean <- summary %>% pull(mean_height)
plottable_median <- summary %>% pull(median_height)
plottable_short <- summary %>% pull(short)
plottable_tall <- summary %>% pull(tall)
todd_shaw <- rappers %>% filter(artist_name=="Too-$hort") %>% pull(height_decimal)
Too $hort is the solid red line. The rest of the lines match up with summary values from the IQR.
ggplot(rappers, aes(x=height_decimal)) +
geom_histogram(binwidth=1/12, col="white", lwd=0.25) +
geom_vline(xintercept=plottable_mean, color="forestgreen", lty=2) +
geom_vline(xintercept=plottable_median, color="gold", lty=2) +
geom_vline(xintercept=todd_shaw, color="red") +
geom_vline(xintercept=plottable_short, color="orange", lty=2) +
geom_vline(xintercept=plottable_tall, color="royalblue", lty=2)
Rapper origin is a categorical variable (nominal).
ggplot(rappers, aes(x=origin)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
origin_summary <- rappers %>% group_by(origin) %>%
summarize(mean_age=mean(age),
mean_net=mean(net_worth))
origin_summary
ggplot(origin_summary, aes(x=origin, y=mean_net, fill=mean_age)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(origin_summary, aes(x=origin, y=mean_age, fill=mean_net)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))