This project, although hosted on my site, was a collaborative project between myself and two outstanding others.
See Patrick Vacek's profile and Graham Smith's profile for data science brilliance.

View Project Page | Previous Step: Data Acquisition | Next Step: Exploring Language of White House Petitions
Petition Success Model

Part II: Building a prediction model for petition success

For the second step of my work on this project, I decided to attempt some statistical modeling of the time series data. My target parameter to estimate was the probability $\hat{\pi}$ that a petition would pass. The probability is conditioned a set of variables $X_{1},...,X_{p}$ that I attempted to discover. Given that we are attempting to estimate a binary outcome variable, with continuous predictors, indicated to me that using logistic regression would be a good choice.

Step 1: Find out how many petitions passed, select the useful data, and model the outcome variable. Create time series based variable transformations.

We can derive the outcome variable by finding which petitions exceeded 100,000 signatures. We will begin by finding which petitions were successful.

In [4]:
#petitions_timeseries.to_csv("petitionsTS.csv")
#Read in the data from GitHub instead
petitions_timeseries=pd.read_csv("https://raw.githubusercontent.com/palautatan/project141b/master/data/petitionsTS.csv",index_col="Unnamed: 0")
In [12]:
successful_petitions=np.unique(petitions_timeseries[(petitions_timeseries.total>=100000)&(petitions_timeseries.days_left>=0)].id)
In [13]:
petitions[np.in1d(petitions.index,successful_petitions)==True].title
Out[13]:
277    A proposal to continue funding the National En...
292    Immediately abolish the Net Worth Sweep of Fan...
296                         Do not defund the NEA or NEH
297    Demand Trump administration add LGBT rights, c...
298    Investigate Hillary Clinton for crimes committ...
299                     Repeal the 1986 Hughes amendment
300                                       Repeal the NFA
Name: title, dtype: object

Unfortunately, we are only working with 7 successful petitions, this may cause some issues because we are trying to predict a rare event, given our data. Nonetheless, we can "extend out" the number of successes by mapping the value to all observations of a successful petition. We can find this result by creating the outcome variable. The result can be done with some simple numpy tricks.

In [14]:
passed=np.in1d(petitions_timeseries.id.values,successful_petitions).astype(int)
petitions_timeseries["passed"]=passed

Before we should derive the count of data entries where the petition eventually passed 100,000 signatures, we should subset out any data that might not be useful to us. We are trying to look at petitions before they reach 100,000 signatures, so we'll only consider when the total signatures is less than that value. We also don't want any petitions that have expired, so we'll only consider cases where the number of days is at least 0.

In [15]:
petitions_model=petitions_timeseries[(petitions_timeseries.total<100000)&(petitions_timeseries.days_left>=0)]

Now we can derive the counts of successes and failures:

In [16]:
petitions_model.groupby("passed").id.count()
Out[16]:
passed
0    190376
1      1524
Name: id, dtype: int64

Although the ratio may be potentially smaller, having a decent amount of variables having the "passed" value will help with your model selection process.

Because the data may have apparent seasonality, I've also decided to add "moving average" and differencing variables to eliminate seasonality among the petitions. We'll use these variables, and compare how they fit in the model fitting section.

In [25]:
def diff_signatures(pet_id):
    """diff_signatures: apply time series differencing, keep all values"""
    sigs_diff=pd.DataFrame(np.concatenate((np.zeros(24),petitions_model[petitions_model.id==pet_id].value))).diff(24).loc[24:]
    return(sigs_diff)

def ma_signatures(pet_id):
    """ma_signagtures: apply moving average, keep all values if length is less than 24"""
    subset=petitions_model[petitions_model.id==pet_id]
    if(len(subset)<24):
        return(pd.DataFrame(subset.value))
    sigs_diff=pd.DataFrame(petitions_model[petitions_model.id==pet_id].value).rolling(24).mean().rename(columns={0:"value"})
    return(sigs_diff)
In [26]:
sigs_ma=pd.concat([ma_signatures(pet_id) for pet_id in petitions.index]).set_index(petitions_model.index)

sigs_diff=pd.concat([diff_signatures(pet_id) for pet_id in petitions.index]).set_index(petitions_model.index)
In [27]:
petitions_model.is_copy=False
petitions_model["day_lag"]=sigs_diff.values
petitions_model["moving_avg"]=sigs_ma.values

Now that we have some transformed variables as potential model options, we can move onto data exploration.

Step 2: Perform data exploration before we attempt to fit the model.

Before we attempt to fit the model, let's examine how some of the potential variables are related. For the exploration portion, I have made three plots below examining the variable relationships.

In [28]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)
from pandas.tools.plotting import scatter_matrix

Examine the estimated probability across key variables using hexbinning

Relationship between days left and total signatures

We can see that the rate at which petitions get signatures is a very important factor in whether or not they will succeed, and that the bar is fairly high.

In [29]:
plt.hexbin(petitions_model.days_left, petitions_model.total, C=petitions_model.passed, gridsize=30, bins=None)
plt.xlim(30,0)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()   

Relationship between days left and signatures per hour

This shows the inverse of the previous graph.

In [30]:
plt.hexbin(petitions_model.days_left, petitions_model.value, C=petitions_model.passed, gridsize=30, bins=None)
plt.xlim(30,0)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()   

Relationship between total signatures and moving average

This graph makes it clear that a moving average model is the most effective at modeling the realtionship between signature rate and success, because the data is considerably more independent of time than with differencing, also lacks seasonality, and reaches a high probability of success quickly.

In [33]:
plt.hexbin(petitions_model.total, petitions_model.moving_avg, C=petitions_model.passed, gridsize=30, bins=None)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()

Relationship between total signatures and difference

To confirm that a moving average model is the best we also performed differencing, which produces a decent model that appears to deal with the seasality issue (but not the stationary one, unfortunately).

In [35]:
plt.hexbin(petitions_model.total, petitions_model.day_lag, C=petitions_model.passed, gridsize=30, bins=None)
cb = plt.colorbar()
cb.set_label("Estimated probability")
plt.show()

Step 3: Fitting a logistic regression model

For building a logistic model, I decided to follow some of the code from this page: http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976. The page taught me how to implement logistic regresion in Python, so my techniques are similar to his. I had the option of using StatsModels over SciPy, but they appear to not use regularization parameters in StatsModels. SciPy implemented a regularization parameter for me, which helps reduce the variance of the model.

In [37]:
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import math

For the model fitting, I decided to compare 3 models.

(1) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*movingavg$

(2) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*daylag$

(3) $Y=B_{0}+B_{1}*daysleft+B_{2}*total+B_{3}*value$

(4) $Y=B_{0}+B_{1}*daysleft+B_{2}*total$

In [73]:
petitions_non_na=petitions_model[np.isnan(petitions_model.moving_avg)==False]

def buildModel(features):
    y, X = dmatrices('passed ~ days_left + total'+features, petitions_non_na, return_type = 'dataframe')
    y=np.ravel(y)
    y_avg=np.mean(y)
    model=LogisticRegression()
    model.fit(X,y)
    probs=model.predict_proba(X)[:,1]
    return(probs)
In [80]:
models_list=[" + moving_avg"," + day_lag"," + value",""]
models=[buildModel(feat) for feat in models_list]
In [104]:
models_df=pd.DataFrame(np.array(models).T,index=petitions_non_na.index,columns=["prob_mov_avg","prob_daylag","prob_value","prob_total"])
petitions_prob=pd.concat([petitions_non_na,models_df],axis=1)
In [105]:
petitions_prob["pct_total"]=petitions_prob.total/100000

Assessing the probabilistic fit of our model

Unfortunately, there are not strong diagnostic or selection tools for logistic regression, so I decided to eyeball test the models on one of the petitions. One of the key issues was trying to remove the heavy fluctuations in the probability based on seasonality. By examining our plot below, it appears the moving average model does the best job at removing fluctuations. We can examine this by examining the petitions total signatures and also the probability models. While the null model (prob_total) may fit well as well, it will not give us any good insight as to when a petition is "hot", so we may not know when a petition with low signature count is likely to succeed.

In [106]:
petitions_prob[petitions_prob.id==296].plot(x="days_left",y=["pct_total","prob_mov_avg","prob_daylag","prob_value","prob_total"])
plt.title("Comparison of the four logistic models")
plt.plot()
Out[106]:
[]


View Project Page | Previous Step: Data Acquisition | Next Step: Exploring Language of White House Petitions