Goals

We’re going to go over three important tools in R by exploring a dataset of rappers. I made this dataset based on rappers I enjoy, therefore as we go through this whole exploration together, be wary that I am a Bay Area hip hop fan and we have a dataset full of more internationally-famed rappers and more “local” artists.

  1. Dataset Manipulation with library(dplyr)
  2. Plots with library(ggplot2)
  3. Making a Linear Model

In this exploration, we’re going to start with dataset manipulation, although when you start exploring datasets yourself, you may want to even begin with plots, then manipulating dataframes as necessary. There’s no one right way to explore, but the point is… you should explore!

Part 1: Dataset Manipulation

Let’s load in our library.

library(dplyr)

The Rappers Dataset

We will be using a dataset with information about rappers. Here’s a preview of our dataset.

##      artist_name         legal_name age     origin net_worth start_year
## 1    Nicki Minaj        Onika Maraj  35   New York        75       2004
## 2          Jay Z       Shawn Carter  48   New York       900       1986
## 3         Eminem   Marshall Mathers  45   Missouri       190       1988
## 4 Kendrick Lamar Kendrick Duckworth  31 California        45       2003
## 5          Logic        Robert Hall  28   Maryland        10       2009
## 6           E-40       Earl Stevens  50 California        10       1986

The entire dataset has size 28x6 and include information on the following rappers.

##  [1] Nicki Minaj       Jay Z             Eminem           
##  [4] Kendrick Lamar    Logic             E-40             
##  [7] Nas               Jadakiss          Chance the Rapper
## [10] Childish Gambino  Kanye West        Cardi B          
## [13] G-Eazy            2Pac              YG               
## [16] ScHoolboy Q       Lupe Fiasco       Drake            
## [19] Joey BadA$$       Snoop Dogg        Iamsu!           
## [22] J. Cole           Gucci Mane        Rich Brian       
## [25] Too $hort         Mac Dre           Missy Elliott    
## [28] Notorious B.I.G. 
## 28 Levels: Cardi B Chance the Rapper Childish Gambino E-40 ... YG

In the following cell blocks, we’re going to apply several different dplyr functions on the dataset and see what we can find out.

arrange()

Use arrange() to sort based on a column (or columns) in your dataframe.

rapper_df %>% arrange(artist_name)
rapper_df %>% arrange(start_year)
rapper_df %>% arrange(age)

# * WITHIN THE AGE BRACKETS
rapper_df %>% arrange(age, net_worth)

summarize()

Use summarize() to “use a function” on a column in your dataframe. This function will return a new dataframe (aka not rapper_df) with the results from your calculations.

rapper_df %>% summarize(mean_age=mean(age),
              mean_net=mean(net_worth))

group_by()

Use group_by() as an intermediate function to tell your next function you want it to calculate a statistics based on rows that share a common column value.

rapper_df %>% group_by(origin) %>%
  summarize(mean_age=mean(age),
            mean_net=mean(net_worth))

select()

Use select to ask for the columns you want. This will return to you the same amount of rows as the original dataframe and (as long as you don’t ask for all the columns) less columns.

rapper_df %>% select(artist_name, origin, net_worth)

mutate()

Use mutate to add a new column in your dataset based on previously existing columns. Notice that we’re updating our original dataframe by assigning the dataframe we generated through all the code to the right of the <- to rapper_df. This will make rapper_df have two more columns than it used to.

rapper_df <- rapper_df %>% mutate(years_active=2018-start_year,
                     worth_per_year=net_worth/years_active)
rapper_df

rename()

Use rename when you want to change one of your column names. Sometimes, you may want to do this just because you prefer a name over the existing one.

rapper_df %>% rename(just_a_number=age,
                     cash=net_worth)

filter()

Use filter() to create useful subsets or to simply explore your data. This function will check that for each row that some column value matches your specified criteria. It will return only the rows that have that specified criteria, therefore the amount of columns in your generated dataframe will have the same amount of columns and potentially less rows.

In our case, our dataframe is pretty small, but imagine that if we had 500 entries on different rappers, we wouldn’t be able to spot certain criteria just by eye.

rapper_df %>% filter(age>35)

rapper_df %>% filter(origin=="New York")

rapper_df %>% filter(age<median(age),
                     net_worth>median(net_worth))

rapper_df %>% filter(age>35 | !(origin=="New York"))
rapper_df %>% filter(age>35 & !(origin=="New York"))

rapper_df %>% filter(age>35 & age<35)

Part 2: Data Visualization

In our next block, we’re going to talk about making plots of our data with ggplot2. Let’s load in the library.

library(ggplot2)

Scatterplots

Scatterplots are how we visualize a potential relationship between two numeric variables. We use geom_point() to throw the points onto our ggplot() canvas.

ggplot(rapper_df, aes(x=age, y=net_worth))

# * BASIC
ggplot(rapper_df, aes(x=age, y=net_worth)) + geom_point()

# * SPECIAL
ggplot(rapper_df[-2,], aes(x=age, y=net_worth)) +
  geom_point(col="#00BFC4") +
  xlab("Age") +
  ylab("Net Worth") +
  ggtitle("Rappers' Net Worth versus Age")

subset <- rapper_df %>% filter(origin %in% c("California", "New York"))
ggplot(subset, aes(x=age, y=net_worth)) +
  geom_point(aes(col=origin)) +
  facet_wrap(~origin) +
  xlab("Age") +
  ylab("Net Worth")

Barplots

Barplots are used to visualize categorical variables. Let’s see what we can find when we plot about these artists’ origins.

# * BASIC
ggplot(rapper_df, aes(x=origin)) + geom_bar()

# * SPECIAL
ggplot(rapper_df, aes(x=origin)) +
  geom_bar(aes(fill=origin)) +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Instead of counting up how many rappers are from which origin, we can also provide a different amount of counts for ggplot2 to visualize. Here, I’m telling ggplot that I want the mean ages of each of the origins to be plotted in a bar graph. The difference here is that we provide y in aes() and stat="identity" in geom_bar().

subset_2 <- rapper_df %>% group_by(origin) %>%
  summarize(mean_age=mean(age))

ggplot(subset_2, aes(x=origin, y=mean_age)) +
  geom_bar(stat="identity", aes(fill=origin)) +
  ylab("mean age") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Part 3: Linear Models

Modeling and prediction are important to statistics. Linear models are one of the simplest if not the simplest. Our dataset has some numeric variables, and I already suspect age and amount of years active has a positive relationship.

But to be sure, let’s check if a linear relationship might explain our data.

From the looks of it, a linear model is appropriate for these data. We can make one by using the base R lm() function.

linear_model <- lm(years_active ~ age, data=rapper_df)

library(broom)
tidy(linear_model)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -18.9     1.74       -10.9 3.44e-11
## 2 age             1.02    0.0456      22.5 1.47e-18

It’s standard to plot our linear model onto our data! We use geom_abline() to do this. Notice that I’m plotting the geom_abline() before geom_point() because I want the data points to be plotted on top of the line instead of below the line. Switch the order of these functions to see what happens.

slope <- 1.023801
intercept <- -18.914662
ggplot(rapper_df, aes(y=years_active, x=age)) +
  geom_abline(slope=slope, intercept=intercept, col="darkgrey") +
  geom_point(aes(col=origin)) +
  ggtitle("Rapper's Age and Years Active") +
  ylab("Years Active") +
  xlab("Age")

We can calculate the \(R^2\) value, the coefficient of determination.

rapper_df %>% summarize(r=cor(age, years_active),
                        r_2=r^2)
##           r       r_2
## 1 0.9752193 0.9510526

Our \(R^2\) tells us how much of the data’s average distance from the mean is predictable by our variables. (How much of our data’s position on the plot is explainable by the relationship we’re looking at?) When \(R^2\) is closer to 1, which in this case it is, we can more confidently use the line to predict for age values that range from the minimum value of our age and maximum value of our ages. (But never forget: correlation does not imply causation!)

rapper_df %>% summarize(min_age=min(age),
                        max_age=max(age))
##   min_age max_age
## 1      19      52

From a little Google search, I found out that Lil Wayne is 35 years old which is well within this range of ages. I also found out that Travis Scott is 26.

years_active_predicted <- slope*c(35, 26) + intercept
years_active_predicted
## [1] 16.918373  7.704164

On Wikipedia, it says that Lil Wayne has been active since 1991, i.e. for 27 years. Similarly, it says that Travis Scott has been active since 2008 which is 10 active years. Our estimates are not so great based on our line. There are plenty of reasons why this can happen, but one reason I can think of is very common.

We don’t have enough data!

Summary

We looked at a dataset that I made on rappers that I like. We used dplyr to see different subsets of our data and to calculate summary statistics. We then used ggplot2 to see if we could spot any patterns in our data by eye. Because we saw a positive linear association, we thought to make a linear model and used it to predict a famous rapper’s amount of years active.