For the second step of my work on this project, I decided to attempt some statistical modeling of the time series data. My target parameter to estimate was the probability $\hat{\pi}$ that a petition would pass. The probability is conditioned a set of variables $X_{1},...,X_{p}$ that I attempted to discover. Given that we are attempting to estimate a binary outcome variable, with continuous predictors, indicated to me that using logistic regression would be a good choice.
We can derive the outcome variable by finding which petitions exceeded 100,000 signatures. We will begin by finding which petitions were successful.
#petitions_timeseries.to_csv("petitionsTS.csv")
#Read in the data from GitHub instead
petitions_timeseries=pd.read_csv("https://raw.githubusercontent.com/palautatan/project141b/master/data/petitionsTS.csv",index_col="Unnamed: 0")
successful_petitions=np.unique(petitions_timeseries[(petitions_timeseries.total>=100000)&(petitions_timeseries.days_left>=0)].id)
petitions[np.in1d(petitions.index,successful_petitions)==True].title
Unfortunately, we are only working with 7 successful petitions, this may cause some issues because we are trying to predict a rare event, given our data. Nonetheless, we can "extend out" the number of successes by mapping the value to all observations of a successful petition. We can find this result by creating the outcome variable. The result can be done with some simple numpy tricks.
passed=np.in1d(petitions_timeseries.id.values,successful_petitions).astype(int)
petitions_timeseries["passed"]=passed
Before we should derive the count of data entries where the petition eventually passed 100,000 signatures, we should subset out any data that might not be useful to us. We are trying to look at petitions before they reach 100,000 signatures, so we'll only consider when the total signatures is less than that value. We also don't want any petitions that have expired, so we'll only consider cases where the number of days is at least 0.
petitions_model=petitions_timeseries[(petitions_timeseries.total<100000)&(petitions_timeseries.days_left>=0)]
Now we can derive the counts of successes and failures:
petitions_model.groupby("passed").id.count()
Although the ratio may be potentially smaller, having a decent amount of variables having the "passed" value will help with your model selection process.
Because the data may have apparent seasonality, I've also decided to add "moving average" and differencing variables to eliminate seasonality among the petitions. We'll use these variables, and compare how they fit in the model fitting section.
def diff_signatures(pet_id):
"""diff_signatures: apply time series differencing, keep all values"""
sigs_diff=pd.DataFrame(np.concatenate((np.zeros(24),petitions_model[petitions_model.id==pet_id].value))).diff(24).loc[24:]
return(sigs_diff)
def ma_signatures(pet_id):
"""ma_signagtures: apply moving average, keep all values if length is less than 24"""
subset=petitions_model[petitions_model.id==pet_id]
if(len(subset)<24):
return(pd.DataFrame(subset.value))
sigs_diff=pd.DataFrame(petitions_model[petitions_model.id==pet_id].value).rolling(24).mean().rename(columns={0:"value"})
return(sigs_diff)
sigs_ma=pd.concat([ma_signatures(pet_id) for pet_id in petitions.index]).set_index(petitions_model.index)
sigs_diff=pd.concat([diff_signatures(pet_id) for pet_id in petitions.index]).set_index(petitions_model.index)
petitions_model.is_copy=False
petitions_model["day_lag"]=sigs_diff.values
petitions_model["moving_avg"]=sigs_ma.values
Now that we have some transformed variables as potential model options, we can move onto data exploration.
Before we attempt to fit the model, let's examine how some of the potential variables are related. For the exploration portion, I have made three plots below examining the variable relationships.
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)
from pandas.tools.plotting import scatter_matrix
plt.hexbin(petitions_model.days_left, petitions_model.total, C=petitions_model.passed, gridsize=30, bins=None)
plt.xlim(30,0)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()
This shows the inverse of the previous graph.
plt.hexbin(petitions_model.days_left, petitions_model.value, C=petitions_model.passed, gridsize=30, bins=None)
plt.xlim(30,0)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()
This graph makes it clear that a moving average model is the most effective at modeling the realtionship between signature rate and success, because the data is considerably more independent of time than with differencing, also lacks seasonality, and reaches a high probability of success quickly.
plt.hexbin(petitions_model.total, petitions_model.moving_avg, C=petitions_model.passed, gridsize=30, bins=None)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()
To confirm that a moving average model is the best we also performed differencing, which produces a decent model that appears to deal with the seasality issue (but not the stationary one, unfortunately).
plt.hexbin(petitions_model.total, petitions_model.day_lag, C=petitions_model.passed, gridsize=30, bins=None)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()
For building a logistic model, I decided to follow some of the code from this page: http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976. The page taught me how to implement logistic regresion in Python, so my techniques are similar to his. I had the option of using StatsModels over SciPy, but they appear to not use regularization parameters in StatsModels. SciPy implemented a regularization parameter for me, which helps reduce the variance of the model.
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import math
For the model fitting, I decided to compare 3 models.
(1) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*movingavg$
(2) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*daylag$
(3) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*value$
(4) $Y=B_{0}+B_{1}*daysleft+B_{2}*total$
petitions_non_na=petitions_model[np.isnan(petitions_model.moving_avg)==False]
def buildModel(features):
y, X = dmatrices('passed ~ days_left + total'+features, petitions_non_na, return_type = 'dataframe')
y=np.ravel(y)
y_avg=np.mean(y)
model=LogisticRegression()
model.fit(X,y)
probs=model.predict_proba(X)[:,1]
return(probs)
models_list=[" + moving_avg"," + day_lag"," + value",""]
models=[buildModel(feat) for feat in models_list]
models_df=pd.DataFrame(np.array(models).T,index=petitions_non_na.index,columns=["prob_mov_avg","prob_daylag","prob_value","prob_total"])
petitions_prob=pd.concat([petitions_non_na,models_df],axis=1)
petitions_prob["pct_total"]=petitions_prob.total/100000
Unfortunately, there are not strong diagnostic or selection tools for logistic regression, so I decided to eyeball test the models on one of the petitions. One of the key issues was trying to remove the heavy fluctuations in the probability based on seasonality. By examining our plot below, it appears the moving average model does the best job at removing fluctuations. We can examine this by examining the petitions total signatures and also the probability models. While the null model (prob_total) may fit well as well, it will not give us any good insight as to when a petition is "hot", so we may not know when a petition with low signature count is likely to succeed.
petitions_prob[petitions_prob.id==296].plot(x="days_left",y=["pct_total","prob_mov_avg","prob_daylag","prob_value","prob_total"])
plt.title("Comparison of the four logistic models")
plt.plot()