This project, although hosted on my site, was a collaborative project between myself and two outstanding others.
See Patrick Vacek's profile and Graham Smith's profile for data science brilliance.

View Project Page | Next Step: Preliminary Petition Success Model Fitting
Data Acquisition

Data Sources

Our first step is to compile the data we will be working with, which we retrieved from two sources: the We The People API for White House petition data, and from Twitter profile pages. These seemed like decent proxies for the metrics we intended to look for, as White House petitions can be submitted by anyone and twitter is a reasonably acceptable barometer for public opinion (although it tends to skew more liberal which makes it far from ideal).

Contents

  1. We The People API
  2. Twitter

We The People API

Step 1: Import packages, set up a function to extract the live petitions

Using the We the People API, we were able to extract petition data for several hundred petitions. The information we retrieved includes: (1) the petition ID, (2) date created, (3) deadline date, (4) petition title, (5) issues addressed, (6) petition type, (7) signature count, (8) the petition itself, and (9) the URL to the petition.

In [1]:
# LIBRARIES
from fastcache import clru_cache
from datetime import datetime as dt
import requests
import numpy as np
import re
import pandas as pd
import math
np.set_printoptions(suppress=True)

I will be using clru_cache, which is a function cache, in order to cache queries. This means I need to make sure the function has as few unique parameters as possible. The following are the functions we used to retrieve the data we desired.

In [2]:
key="ETjU0uiiXFfA9AqvBUFooOEx2OmBdeq0nquzM1k4"

# API EXTRACTION FUNCTION
@clru_cache(maxsize=128,typed=False)
def get_petitions(limit="1000"):
    """
    get_petitions: extract petitions from the WeThePeople API
    INPUT: limit(string): Number of petitions to return.
    OUTPUT: request_json(dict): A dictionary containing the JSON file"""
    base="https://api.whitehouse.gov/v1/petitions.json?limit="+limit+"&sortBy=date_reached_public"+"&sortOrder=desc"
    request_get=requests.get(base+"&api_key="+key)
    request_json=request_get.json()
    return(request_json)

It should be noted that we can convert a dictionary into a dataframe simply by calling Pd.DataFrame. As we can see, the petitions are stored under the "results" key. If we call pd.DataFrame on this object in the future, it will make it easier to work with.

In [3]:
get_petitions("1")["results"]
Out[3]:
[{u'body': u'The Affordable Care Act (ACA) has enabled millions of Americans to obtain life-saving health care that has been unavailable to many in the past. Under the ACA, people with pre-existing conditions can get the health care that they need to improve quality of life. We urge the Administration to keep the key components of the ACA intact, including not allowing exclusion for pre-existing conditions. Let us maintain our moral compass as a nation and facilitate health care access for all Americans, regardless of ability to pay.',
  u'created': 1488599633,
  u'deadline': 1491188033,
  u'id': u'2516526',
  u'isPublic': True,
  u'isSignable': True,
  u'issues': [{u'id': 25, u'name': u'Health Care'}],
  u'petition_type': [{u'id': 281,
    u'name': u'Change an existing Administration policy'}],
  u'reachedPublic': 1490120785,
  u'response': [],
  u'signatureCount': 155,
  u'signatureThreshold': 100000,
  u'signaturesNeeded': 99845,
  u'status': u'open',
  u'title': u'Keep, but modify, the Affordable Care Act',
  u'type': u'petition',
  u'url': u'https://petitions.whitehouse.gov/petition/keep-modify-affordable-care-act'}]

This is the dataframe we constructed.

Step 2: Cleaning the petitions data

Our next step is converting all of the live petitions into an easy to read dataframe. To do this, we will need to create a function that handles the JSON and does a couple of string manipulation techniques to clean the data.

In [4]:
def json_to_df(json):
    #Create a dataframe from the dictionary, some values will be raw and need conversion
    pet_df=pd.DataFrame(json)
    #The issues and petition type fields need fixing
    pet_df.issues=pd.Series([", ".join([re.sub("amp;","",issue["name"]) for issue in item["issues"]]) for item in json])
    pet_df.petition_type=pd.Series(["".join([category["name"] for category in item["petition_type"]]) for item in json])
    pet_df = pet_df.drop('response', 1)
    return(pet_df)

Finally, we can run our functions to generate the data frame. We will be using the petition id for the next step. I've included an optional cell that writes the data if necessary.

In [5]:
json_petitions=get_petitions()
petitions=json_to_df(json_petitions["results"])
petitions.head()
Out[5]:
body created deadline id isPublic isSignable issues petition_type reachedPublic signatureCount signatureThreshold signaturesNeeded status title type url
0 The Affordable Care Act (ACA) has enabled mill... 1488599633 1491188033 2516526 True True Health Care Change an existing Administration policy 1490120785 155 100000 99845 open Keep, but modify, the Affordable Care Act petition https://petitions.whitehouse.gov/petition/keep...
1 The "fake news" meme suppresses the ... 1487985976 1490574376 2511691 True True Civil Rights & Equality, Economy & Jobs Call on Congress to act on an issue 1490103886 158 100000 99842 open Google, et al., recent actions violate freedom... petition https://petitions.whitehouse.gov/petition/goog...
2 Stop SPRAYING the chemtrails. It's poisoning o... 1489957100 1492549100 2524366 True True Energy & Environment Call on Congress to act on an issue 1490088738 335 100000 99665 open STOP THE SPRAYING OF THE CHEMTRAILS petition https://petitions.whitehouse.gov/petition/stop...
3 Allow nation-wide concealed carry to any indiv... 1488552578 1491140978 2515941 True True Government & Regulatory Reform Propose a new Administration policy 1490052376 173 100000 99827 open Enact legislation to provide nation-wide recip... petition https://petitions.whitehouse.gov/petition/enac...
4 Sheriff Joe Arpaio is under assault from the L... 1489776659 1492368659 2523566 True True Criminal Justice Reform Take or explain a position on an issue or policy 1489951075 217 100000 99783 open PROTECT SHERIFF JOE ARPAIO FROM UNLAWFUL PROSE... petition https://petitions.whitehouse.gov/petition/prot...
In [ ]:
#petitions.to_csv("petitions.csv")

I ended up labelling the petitions as "Liberal" or "Conservative" myself, which may introduce some bias. Nonetheless, additional data was acquired and compared to measure the effectiveness of the model, which may help reduce the effect of the bias.

Step 3: Developing the ideal petition signatures data.

The next step in the process is acquiring and making data that I could use for the creation of my model predicting petition success. We start off by creating another function to query the signatures portion of the API. We now need to call the API by petition id, and I also introduce a new variable called offset because many signatures have a significant amount of petitions that cannot be extracted in a single query.

In [6]:
@clru_cache(maxsize=128,typed=False)
def get_signatures(pet_id,limit="1000",offset="0"):
    """
    get_signatures: Extract a set of signatures from a WeThePeople petition
    INPUT:
    pet_id(string): The petition id from the petitions section of the API
    limit(string): The number of signatures to return
    offset(string): The index at which to start on the list of petitions
    OUTPUT:
    request_json(dict): A dictionary containing the JSON file
    """
    base="https://api.whitehouse.gov/v1/petitions/"+pet_id+"/signatures.json?limit="+limit+"&offset="+offset
    request_get=requests.get(base+"&api_key="+prodkey)
    request_json=request_get.json()
    return(request_json)

As we saw in the petitions dataframe, the time variables "created" and "deadline" are not in a conventional format. They are actually in UNIX time. Before we get to extracting the signatures, we need to have functions for converting the "created" and "deadline" variables to and from UNIX time.

In [7]:
# UNIX TIME CONVERSION BORROWED FROM STACK
def convertUNIXtime(time):
    """Takes a unix time integer and converts it into a date string."""
    new_time=dt.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S')
    return(new_time)

# DATE BACK INTO UNIX
def convertDate(time):
    """Takes a date string and converts it into a unix time integer."""
    new_time=int(dt.strptime(time,"%Y-%m-%d %H:%M:%S").timestamp())
    return(new_time)

Now that we are equipped with the proper tools for handling the data, we can start writing functions to make repeated calls to the API for signature extraction. We want to start by writing a simple helper function, findSigs, this will call the API once at a given offset, and then return an array of the corresponding UNIX times at which the petition was signed.

In [8]:
def findSigs(petition,offset):
    """findSigs: call the signatures API once at a given offset
    INPUT:
    petition(string): The petition id
    offset(string): The index of the list of petition signatures
    OUTPUT:
    sig_times(array): An array containing UNIX time values at which the petition was signed"""
    sig_times=np.array(pd.DataFrame(get_signatures(petition,offset=offset)["results"]).created.values,dtype='int64')
    return(sig_times)

Once we have a set of signatures, we would like to make it into time series data. The makeTS function will do this for us, and it serves as a helper function for the final, generalized function afterwards. I've explained the strategy on how to convert the UNIX times to time series data within the code, step-by-step.

In [9]:
def makeTS(data,pet_idx):
    """
    makeTS: Convert the set of all signatures times into an hourly time series dataframe.
    INPUT:
    data(array): An integer array containing UNIX times of all petition signatures
    pet_idex(int): The index on the petitions data frame, used for determining the deadline.
    OUTPUT:
    ts_df(dataframe): A pandas dataframe containing the cumulative signatures, by-hour signatures, days left, and id variables.
    """
    #Find the number of time periods for the petition
    num_hours=math.ceil((np.max(data)-np.min(data))/3600)+2
    #Find the date of the first hour of the petition
    blank_bin=re.sub("[0-9]+:[0-9]+$","",convertUNIXtime(np.min(data)))+"00:00"
    #Generate the time intervals for each hour that the petition has been running
    unix_intervals=[convertDate(blank_bin)+i*3600 for i in range(1,num_hours)]
    #Create the days left variable by subtracting each unix time by hour from the deadline
    days_left=np.round((petitions.deadline[pet_idx]-unix_intervals)/(60*60*24),decimals=2)
    #Derive the cumulative signatures by hour, and difference them to obtain the hourly signature counts
    cumulative_sigs=[len(np.extract(data<=ui,data)) for ui in unix_intervals]
    sigs_diff=np.diff(np.append([0],cumulative_sigs))
    #Find the corresponding hours for the time periods and use it as an index
    hours=[convertUNIXtime(ui) for ui in unix_intervals]
    #Finally, make a data frame with all these variables
    ts_df=pd.DataFrame({"total":cumulative_sigs,"value":sigs_diff,"days_left":days_left,
                        "id":[pet_idx]*len(days_left)},index=hours)
    return(ts_df)

We can combine these two functions in order to create get_all_signatures, a function that will create a time series dataset for the signatures of every petition we have from the WeThePeopleAPI. In order to make sure I call the API a proper number of times without encountering error, I set the number of queries as $\lceil \frac{n}{1000} \rceil$, where $n$ is the number of signatures in the petition.

In [10]:
def get_all_signatures(idx):
    #Get the desired number of API calls for the 
    num=math.ceil(petitions.signatureCount[idx]/1000)
    #Make all of the calls to the API
    sigtimes_list=[findSigs(petitions.id[idx],str(1000*i)) for i in range(num)]
    #Finally, concatenate the list of signature times into an array, and then convert it to a time series dataframe.
    signatures_ts=np.concatenate(sigtimes_list)
    ts_df=makeTS(signatures_ts,idx)
    return(ts_df)

We can finally performt he data extraction, simply by calling a list comprehension with the final function, and then using pd.concat to make the list of dataframes into a single dataframe.

Warning: The operation to extract all of the data is very slow, and may take up to 30 minutes. It is meant to be completed once. I recommend reading in the csv from github instead.

In [12]:
# petitions_timeseries=pd.concat([get_all_signatures(i) for i in petitions.index])

Webscraping Twitter

We also webscraped the 20 most recent tweets, straight from 20 political sources on Twitter, a site where many retrieve their political news. We manually compiled the URL and name of profile in the twitter_links.csv file.

We wanted to get a sampling of people from every position on the political spectrum, so our scraped profiles cover a variety of mainstream/fringe politicians and media figures. These included: Donald Trump, Betsy Devos, Kellyanne Conway, Mike Pence, Mitt Romney, Jeb Bush, Milo Yiannopoulos, Sarah Palin, Ted Cruz, Jerry Brown, Jill Stein, Barack Obama, Joe Biden, Bernie Sanders, Hillary Clinton, Robert Reich, Justin Trudeau, Nate Silver, NYT Politics, CNN Politics, FOX Politics, Post Politics, and We the People.

In [1]:
# LIBRARIES
import pandas as pd
from bs4 import BeautifulSoup
import requests
In [3]:
# IMPORT DATA
twitter = pd.read_csv("data/twitter_links.csv")
twitter.head(10)
Out[3]:
profile url
0 Donald Trump https://twitter.com/POTUS
1 Betsy Devos https://twitter.com/BetsyDeVosED
2 Kellyanne Conway https://twitter.com/KellyannePolls
3 Mike Pence https://twitter.com/mike_pence
4 Mitt Romney https://twitter.com/MittRomney
5 Jeb Bush https://twitter.com/JebBush
6 Milo Yiannopoulos https://twitter.com/DontGoAwayM4d
7 Sarah Palin https://twitter.com/SarahPalinUSA
8 Ted Cruz https://twitter.com/tedcruz
9 Jerry Brown https://twitter.com/JerryBrownGov
In [4]:
# VIEW ALL OF THE SOURCES WE ARE SCRAPING FROM
print twitter["profile"].values
['Donald Trump' 'Betsy Devos' 'Kellyanne Conway' 'Mike Pence' 'Mitt Romney'
 'Jeb Bush' 'Milo Yiannopoulos' 'Sarah Palin' 'Ted Cruz' 'Jerry Brown'
 'Jill Stein' 'Barack Obama' 'Joe Biden' 'Bernie Sanders' 'Hillary Clinton'
 'Robert Reich' 'Justin Trudeau' 'Nate Silver' 'NYT Politics'
 'CNN Politics' 'FOX Politics' 'Post Politics' 'We the People']

The following is the function we used to retrieve the data we desired.

In [5]:
# FUNCTION
def get_tweet_bag(twitter_url):
    # PARSE PROFILE
    this_request = requests.get(twitter_url).text
    abc_soup = BeautifulSoup(this_request, "html.parser")

    # GRAB DATA FOR 20 TWEETS
    twenty_tweet_data = abc_soup.find_all("div", {"class", "js-tweet-text-container"})
    
    # GET THE 20 TWEETS FOR ONE PERSON
    twenty_tweets = [x.find_all("p")[0].text for x in twenty_tweet_data]
    twenty_tweets = [x.encode("ascii", "replace") for x in twenty_tweets]

    # CREATE BAG OF WORDS
    tweet_bag = " ".join(twenty_tweets)
    
    return(tweet_bag)
In [6]:
# LIST COMP TO RETRIEVE 20 TWEETS FROM EACH SOURCE
tweet_bags = [get_tweet_bag(x) for x in twitter["url"]]

This is the dataframe we constructed.

In [7]:
# ASSEMBLING DATAFRAME
twitter["tweet_bags"] = tweet_bags
# twitter.to_csv("data/twitter_data.csv")
twitter.head(10)
Out[7]:
profile url tweet_bags
0 Donald Trump https://twitter.com/POTUS FBI Director Comey: fmr. DNI Clapper "right" t...
1 Betsy Devos https://twitter.com/BetsyDeVosED "At the end of the day we should measure every...
2 Kellyanne Conway https://twitter.com/KellyannePolls Congratulations @erictrump & @LaraLeaTrump on ...
3 Mike Pence https://twitter.com/mike_pence .@POTUS showed true leadership in his #JointAd...
4 Mitt Romney https://twitter.com/MittRomney I'm a fan of proposed Deputy Treasury Secretar...
5 Jeb Bush https://twitter.com/JebBush Such an unnecessary distraction given all the ...
6 Milo Yiannopoulos https://twitter.com/DontGoAwayM4d http://bit.ly/2mRyeJq? via /r/KiA #gamergate K...
7 Sarah Palin https://twitter.com/SarahPalinUSA The best things happen while fishing. Love thi...
8 Ted Cruz https://twitter.com/tedcruz Add your name if you agree -- no US funding fo...
9 Jerry Brown https://twitter.com/JerryBrownGov California is Not Turning Back, Not Now, Not E...


View Project Page | Next Step: Preliminary Petition Success Model Fitting