We’re going to go over three important tools in R by exploring a dataset of rappers. I made this dataset based on rappers I enjoy, therefore as we go through this whole exploration together, be wary that I am a Bay Area hip hop fan and we have a dataset full of more internationally-famed rappers and more “local” artists.
library(dplyr)
library(ggplot2)
In this exploration, we’re going to start with dataset manipulation, although when you start exploring datasets yourself, you may want to even begin with plots, then manipulating dataframes as necessary. There’s no one right way to explore, but the point is… you should explore!
Let’s load in our library.
library(dplyr)
We will be using a dataset with information about rappers. Here’s a preview of our dataset.
## artist_name legal_name age origin net_worth start_year
## 1 Nicki Minaj Onika Maraj 35 New York 75 2004
## 2 Jay Z Shawn Carter 48 New York 900 1986
## 3 Eminem Marshall Mathers 45 Missouri 190 1988
## 4 Kendrick Lamar Kendrick Duckworth 31 California 45 2003
## 5 Logic Robert Hall 28 Maryland 10 2009
## 6 E-40 Earl Stevens 50 California 10 1986
The entire dataset has size 28x6
and include information on the following rappers.
## [1] Nicki Minaj Jay Z Eminem
## [4] Kendrick Lamar Logic E-40
## [7] Nas Jadakiss Chance the Rapper
## [10] Childish Gambino Kanye West Cardi B
## [13] G-Eazy 2Pac YG
## [16] ScHoolboy Q Lupe Fiasco Drake
## [19] Joey BadA$$ Snoop Dogg Iamsu!
## [22] J. Cole Gucci Mane Rich Brian
## [25] Too $hort Mac Dre Missy Elliott
## [28] Notorious B.I.G.
## 28 Levels: Cardi B Chance the Rapper Childish Gambino E-40 ... YG
In the following cell blocks, we’re going to apply several different dplyr
functions on the dataset and see what we can find out.
Use arrange()
to sort based on a column (or columns) in your dataframe.
rapper_df %>% arrange(artist_name)
rapper_df %>% arrange(start_year)
rapper_df %>% arrange(age)
# * WITHIN THE AGE BRACKETS
rapper_df %>% arrange(age, net_worth)
Use summarize()
to “use a function” on a column in your dataframe. This function will return a new dataframe (aka not rapper_df
) with the results from your calculations.
rapper_df %>% summarize(mean_age=mean(age),
mean_net=mean(net_worth))
Use group_by()
as an intermediate function to tell your next function you want it to calculate a statistics based on rows that share a common column value.
rapper_df %>% group_by(origin) %>%
summarize(mean_age=mean(age),
mean_net=mean(net_worth))
Use select
to ask for the columns you want. This will return to you the same amount of rows as the original dataframe and (as long as you don’t ask for all the columns) less columns.
rapper_df %>% select(artist_name, origin, net_worth)
Use mutate
to add a new column in your dataset based on previously existing columns. Notice that we’re updating our original dataframe by assigning the dataframe we generated through all the code to the right of the <-
to rapper_df
. This will make rapper_df
have two more columns than it used to.
rapper_df <- rapper_df %>% mutate(years_active=2018-start_year,
worth_per_year=net_worth/years_active)
rapper_df
Use rename
when you want to change one of your column names. Sometimes, you may want to do this just because you prefer a name over the existing one.
rapper_df %>% rename(just_a_number=age,
cash=net_worth)
Use filter()
to create useful subsets or to simply explore your data. This function will check that for each row that some column value matches your specified criteria. It will return only the rows that have that specified criteria, therefore the amount of columns in your generated dataframe will have the same amount of columns and potentially less rows.
In our case, our dataframe is pretty small, but imagine that if we had 500 entries on different rappers, we wouldn’t be able to spot certain criteria just by eye.
rapper_df %>% filter(age>35)
rapper_df %>% filter(origin=="New York")
rapper_df %>% filter(age<median(age),
net_worth>median(net_worth))
rapper_df %>% filter(age>35 | !(origin=="New York"))
rapper_df %>% filter(age>35 & !(origin=="New York"))
rapper_df %>% filter(age>35 & age<35)
In our next block, we’re going to talk about making plots of our data with ggplot2
. Let’s load in the library.
library(ggplot2)
Scatterplots are how we visualize a potential relationship between two numeric variables. We use geom_point()
to throw the points onto our ggplot()
canvas.
ggplot(rapper_df, aes(x=age, y=net_worth))
# * BASIC
ggplot(rapper_df, aes(x=age, y=net_worth)) + geom_point()
# * SPECIAL
ggplot(rapper_df[-2,], aes(x=age, y=net_worth)) +
geom_point(col="#00BFC4") +
xlab("Age") +
ylab("Net Worth") +
ggtitle("Rappers' Net Worth versus Age")
subset <- rapper_df %>% filter(origin %in% c("California", "New York"))
ggplot(subset, aes(x=age, y=net_worth)) +
geom_point(aes(col=origin)) +
facet_wrap(~origin) +
xlab("Age") +
ylab("Net Worth")
Barplots are used to visualize categorical variables. Let’s see what we can find when we plot about these artists’ origins.
# * BASIC
ggplot(rapper_df, aes(x=origin)) + geom_bar()
# * SPECIAL
ggplot(rapper_df, aes(x=origin)) +
geom_bar(aes(fill=origin)) +
theme(axis.text.x=element_text(angle=45, hjust=1))
Instead of counting up how many rappers are from which origin, we can also provide a different amount of counts for ggplot2
to visualize. Here, I’m telling ggplot
that I want the mean ages of each of the origins to be plotted in a bar graph. The difference here is that we provide y
in aes()
and stat="identity"
in geom_bar()
.
subset_2 <- rapper_df %>% group_by(origin) %>%
summarize(mean_age=mean(age))
ggplot(subset_2, aes(x=origin, y=mean_age)) +
geom_bar(stat="identity", aes(fill=origin)) +
ylab("mean age") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Modeling and prediction are important to statistics. Linear models are one of the simplest if not the simplest. Our dataset has some numeric variables, and I already suspect age and amount of years active has a positive relationship.
But to be sure, let’s check if a linear relationship might explain our data.
From the looks of it, a linear model is appropriate for these data. We can make one by using the base R lm()
function.
linear_model <- lm(years_active ~ age, data=rapper_df)
library(broom)
tidy(linear_model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -18.9 1.74 -10.9 3.44e-11
## 2 age 1.02 0.0456 22.5 1.47e-18
It’s standard to plot our linear model onto our data! We use geom_abline()
to do this. Notice that I’m plotting the geom_abline()
before geom_point()
because I want the data points to be plotted on top of the line instead of below the line. Switch the order of these functions to see what happens.
slope <- 1.023801
intercept <- -18.914662
ggplot(rapper_df, aes(y=years_active, x=age)) +
geom_abline(slope=slope, intercept=intercept, col="darkgrey") +
geom_point(aes(col=origin)) +
ggtitle("Rapper's Age and Years Active") +
ylab("Years Active") +
xlab("Age")
We can calculate the \(R^2\) value, the coefficient of determination.
rapper_df %>% summarize(r=cor(age, years_active),
r_2=r^2)
## r r_2
## 1 0.9752193 0.9510526
Our \(R^2\) tells us how much of the data’s average distance from the mean is predictable by our variables. (How much of our data’s position on the plot is explainable by the relationship we’re looking at?) When \(R^2\) is closer to 1, which in this case it is, we can more confidently use the line to predict for age values that range from the minimum value of our age and maximum value of our ages. (But never forget: correlation does not imply causation!)
rapper_df %>% summarize(min_age=min(age),
max_age=max(age))
## min_age max_age
## 1 19 52
From a little Google search, I found out that Lil Wayne is 35 years old which is well within this range of ages. I also found out that Travis Scott is 26.
years_active_predicted <- slope*c(35, 26) + intercept
years_active_predicted
## [1] 16.918373 7.704164
On Wikipedia, it says that Lil Wayne has been active since 1991, i.e. for 27 years. Similarly, it says that Travis Scott has been active since 2008 which is 10 active years. Our estimates are not so great based on our line. There are plenty of reasons why this can happen, but one reason I can think of is very common.
We don’t have enough data!
We looked at a dataset that I made on rappers that I like. We used dplyr
to see different subsets of our data and to calculate summary statistics. We then used ggplot2
to see if we could spot any patterns in our data by eye. Because we saw a positive linear association, we thought to make a linear model and used it to predict a famous rapper’s amount of years active.