We've got a lot to learn. Let's get started.
(1) Review visualization: boxplots, stacked histograms
(2) Review numerical summaries: means, standard deviations
(3) ANOVA test
(4) Tukey's HSD
(5) Regression test
You know how to do this by now. We're going to read in a Zillow dataset on median price per square foot for homes, then take a subset using filter()
. We are only going to look at the top 10 ranked sized regions in the dataset. Then, we will condense the table into sample statistics.
# * CHUNK NECESSARY
library(readr)
zillow <- read_csv("data/zillow-city-mediansqft.csv")
head(zillow)
# * CHUNK UNNECESSARY
library(reshape2)
zillow <- melt(zillow, id.vars=c("RegionID", "RegionName", "State", "Metro", "CountyName", "SizeRank"))
Here is our dataset. The variable
column is in YYYY-MM format and value
is median dollars per square foot for that region.
# * CHUNK NECESSARY
zillow_subset <- zillow %>% filter(SizeRank<=10)
head(zillow_subset)
And here are the corresponding summary statistics. These are not perfect averages (we are using a subset), just averages based on our dataset.
# * CHUNK NECESSARY
zillow_state <- zillow_subset %>%
group_by(State) %>%
summarize(avg_state_median=mean(value, na.rm=TRUE), var_state_median=var(value, na.rm=TRUE))
zillow_state
state_subset <- zillow_subset %>% select(State, value) %>% filter(!is.na(value))
library(ggplot2)
ggplot(state_subset, aes(y=value, x=State, fill=State)) + geom_boxplot() + theme_minimal()
model <- lm(value ~ State, data=state_subset)
model
First, an ANOVA. These are the hypotheses we are interested in.
$H_0:$ All groups have the same mean.
$H_1:$ At least one of the groups has a different mean.
Are our assumptions met? Check lecture notes. Regardless, we will continue on to demonstrate code.
library(broom)
anova <- aov(model)
tidy(aov)
We are going to do a Tukey's HSD test to compare these three state's average medians!
$H_0:$ All of the pairwise differences (between the states) are the same.
$H_1:$ At least one of the differences is different.
Are our assumptions met? Check lecture notes. Regardless, we will continue on to demonstrate code.
in_all_honesty <- TukeyHSD(anova)
in_all_honesty %>% tidy()
queens <- zillow %>% filter(CountyName=="Queens") %>% filter(!is.na(value)) %>% select(variable, value)
queens <- queens[seq(from=1, to=nrow(queens), by=6),]
# * CHUNK UNNECESSARY
queens <- queens %>% mutate(variable=as.numeric(gsub("-0[0-9]","",variable)))
queens[seq(1, nrow(queens), by=2),1] <- queens[seq(1, nrow(queens), by=2),1] + 0.5
head(queens)
We'll draw the line of best fit on top of the scatterplot. Does it look good?
ggplot(queens, aes(x=variable, y=value)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Median Sqft Cost in Queens, NY from 1996 to 2017")
To me, the above looks like trash. The below (which we will observe but not write any calculations for in this class) looks better.
ggplot(queens, aes(x=variable, y=value)) +
geom_point() +
geom_smooth(method="loess", se=FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Median Sqft Cost in Queens, NY from 1996 to 2017")
We want to test the hypotheses based on the linear regression of Y (median price per sqft) and X (the year).
$H_0: \beta_1=0$
$H_1: \beta_1 \neq 0$
WAAARNNIIINGGG! You must be responsible with your data! The trend does not even look linear in the first place! If you're doing any predictions like this in practice, use time series analysis instead. This will require you a combination of modelling techniques beyond the scope of this class. But of course, again, for pedagogical reasons, we will continue with the example.
Are our assumptions met? Check lecture notes. Regardless, we will continue on to demonstrate code.
queens_model <- lm(value ~ variable, data=queens)
queens_model %>% tidy()
Check it out.
$$ \bar{x} \pm t*\text{SE} \implies 6.3617 \pm t*(1.5288) $$qt(0.975, df=nrow(queens)-2)
queens_model %>% glance()
queens_model %>% summary()