Data Sources¶

Our first step is to compile the data we will be working with, which we retrieved from two sources: the We The People API for White House petition data, and from Twitter profile pages. These seemed like decent proxies for the metrics we intended to look for, as White House petitions can be submitted by anyone and twitter is a reasonably acceptable barometer for public opinion (although it tends to skew more liberal which makes it far from ideal).

Contents¶

We The People API
Twitter

We The People API¶

Step 1: Import packages, set up a function to extract the live petitions¶

Using the We the People API, we were able to extract petition data for several hundred petitions. The information we retrieved includes: (1) the petition ID, (2) date created, (3) deadline date, (4) petition title, (5) issues addressed, (6) petition type, (7) signature count, (8) the petition itself, and (9) the URL to the petition.

# LIBRARIES
from fastcache import clru_cache
from datetime import datetime as dt
import requests
import numpy as np
import re
import pandas as pd
import math
np.set_printoptions(suppress=True)

I will be using clru_cache, which is a function cache, in order to cache queries. This means I need to make sure the function has as few unique parameters as possible. The following are the functions we used to retrieve the data we desired.

key="ETjU0uiiXFfA9AqvBUFooOEx2OmBdeq0nquzM1k4"

# API EXTRACTION FUNCTION
@clru_cache(maxsize=128,typed=False)
def get_petitions(limit="1000"):
    """
    get_petitions: extract petitions from the WeThePeople API
    INPUT: limit(string): Number of petitions to return.
    OUTPUT: request_json(dict): A dictionary containing the JSON file"""
    base="https://api.whitehouse.gov/v1/petitions.json?limit="+limit+"&sortBy=date_reached_public"+"&sortOrder=desc"
    request_get=requests.get(base+"&api_key="+key)
    request_json=request_get.json()
    return(request_json)

It should be noted that we can convert a dictionary into a dataframe simply by calling Pd.DataFrame. As we can see, the petitions are stored under the "results" key. If we call pd.DataFrame on this object in the future, it will make it easier to work with.

get_petitions("1")["results"]

[{u'body': u'The Affordable Care Act (ACA) has enabled millions of Americans to obtain life-saving health care that has been unavailable to many in the past. Under the ACA, people with pre-existing conditions can get the health care that they need to improve quality of life. We urge the Administration to keep the key components of the ACA intact, including not allowing exclusion for pre-existing conditions. Let us maintain our moral compass as a nation and facilitate health care access for all Americans, regardless of ability to pay.',
  u'created': 1488599633,
  u'deadline': 1491188033,
  u'id': u'2516526',
  u'isPublic': True,
  u'isSignable': True,
  u'issues': [{u'id': 25, u'name': u'Health Care'}],
  u'petition_type': [{u'id': 281,
    u'name': u'Change an existing Administration policy'}],
  u'reachedPublic': 1490120785,
  u'response': [],
  u'signatureCount': 155,
  u'signatureThreshold': 100000,
  u'signaturesNeeded': 99845,
  u'status': u'open',
  u'title': u'Keep, but modify, the Affordable Care Act',
  u'type': u'petition',
  u'url': u'https://petitions.whitehouse.gov/petition/keep-modify-affordable-care-act'}]

This is the dataframe we constructed.

Step 2: Cleaning the petitions data¶

Our next step is converting all of the live petitions into an easy to read dataframe. To do this, we will need to create a function that handles the JSON and does a couple of string manipulation techniques to clean the data.

def json_to_df(json):
    #Create a dataframe from the dictionary, some values will be raw and need conversion
    pet_df=pd.DataFrame(json)
    #The issues and petition type fields need fixing
    pet_df.issues=pd.Series([", ".join([re.sub("amp;","",issue["name"]) for issue in item["issues"]]) for item in json])
    pet_df.petition_type=pd.Series(["".join([category["name"] for category in item["petition_type"]]) for item in json])
    pet_df = pet_df.drop('response', 1)
    return(pet_df)

Finally, we can run our functions to generate the data frame. We will be using the petition id for the next step. I've included an optional cell that writes the data if necessary.

json_petitions=get_petitions()
petitions=json_to_df(json_petitions["results"])
petitions.head()

#petitions.to_csv("petitions.csv")

I ended up labelling the petitions as "Liberal" or "Conservative" myself, which may introduce some bias. Nonetheless, additional data was acquired and compared to measure the effectiveness of the model, which may help reduce the effect of the bias.

Step 3: Developing the ideal petition signatures data.¶

The next step in the process is acquiring and making data that I could use for the creation of my model predicting petition success. We start off by creating another function to query the signatures portion of the API. We now need to call the API by petition id, and I also introduce a new variable called offset because many signatures have a significant amount of petitions that cannot be extracted in a single query.

@clru_cache(maxsize=128,typed=False)
def get_signatures(pet_id,limit="1000",offset="0"):
    """
    get_signatures: Extract a set of signatures from a WeThePeople petition
    INPUT:
    pet_id(string): The petition id from the petitions section of the API
    limit(string): The number of signatures to return
    offset(string): The index at which to start on the list of petitions
    OUTPUT:
    request_json(dict): A dictionary containing the JSON file
    """
    base="https://api.whitehouse.gov/v1/petitions/"+pet_id+"/signatures.json?limit="+limit+"&offset="+offset
    request_get=requests.get(base+"&api_key="+prodkey)
    request_json=request_get.json()
    return(request_json)

As we saw in the petitions dataframe, the time variables "created" and "deadline" are not in a conventional format. They are actually in UNIX time. Before we get to extracting the signatures, we need to have functions for converting the "created" and "deadline" variables to and from UNIX time.

# UNIX TIME CONVERSION BORROWED FROM STACK
def convertUNIXtime(time):
    """Takes a unix time integer and converts it into a date string."""
    new_time=dt.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S')
    return(new_time)

# DATE BACK INTO UNIX
def convertDate(time):
    """Takes a date string and converts it into a unix time integer."""
    new_time=int(dt.strptime(time,"%Y-%m-%d %H:%M:%S").timestamp())
    return(new_time)

Now that we are equipped with the proper tools for handling the data, we can start writing functions to make repeated calls to the API for signature extraction. We want to start by writing a simple helper function, findSigs, this will call the API once at a given offset, and then return an array of the corresponding UNIX times at which the petition was signed.

def findSigs(petition,offset):
    """findSigs: call the signatures API once at a given offset
    INPUT:
    petition(string): The petition id
    offset(string): The index of the list of petition signatures
    OUTPUT:
    sig_times(array): An array containing UNIX time values at which the petition was signed"""
    sig_times=np.array(pd.DataFrame(get_signatures(petition,offset=offset)["results"]).created.values,dtype='int64')
    return(sig_times)

Once we have a set of signatures, we would like to make it into time series data. The makeTS function will do this for us, and it serves as a helper function for the final, generalized function afterwards. I've explained the strategy on how to convert the UNIX times to time series data within the code, step-by-step.

def makeTS(data,pet_idx):
    """
    makeTS: Convert the set of all signatures times into an hourly time series dataframe.
    INPUT:
    data(array): An integer array containing UNIX times of all petition signatures
    pet_idex(int): The index on the petitions data frame, used for determining the deadline.
    OUTPUT:
    ts_df(dataframe): A pandas dataframe containing the cumulative signatures, by-hour signatures, days left, and id variables.
    """
    #Find the number of time periods for the petition
    num_hours=math.ceil((np.max(data)-np.min(data))/3600)+2
    #Find the date of the first hour of the petition
    blank_bin=re.sub("[0-9]+:[0-9]+$","",convertUNIXtime(np.min(data)))+"00:00"
    #Generate the time intervals for each hour that the petition has been running
    unix_intervals=[convertDate(blank_bin)+i*3600 for i in range(1,num_hours)]
    #Create the days left variable by subtracting each unix time by hour from the deadline
    days_left=np.round((petitions.deadline[pet_idx]-unix_intervals)/(60*60*24),decimals=2)
    #Derive the cumulative signatures by hour, and difference them to obtain the hourly signature counts
    cumulative_sigs=[len(np.extract(data<=ui,data)) for ui in unix_intervals]
    sigs_diff=np.diff(np.append([0],cumulative_sigs))
    #Find the corresponding hours for the time periods and use it as an index
    hours=[convertUNIXtime(ui) for ui in unix_intervals]
    #Finally, make a data frame with all these variables
    ts_df=pd.DataFrame({"total":cumulative_sigs,"value":sigs_diff,"days_left":days_left,
                        "id":[pet_idx]*len(days_left)},index=hours)
    return(ts_df)

We can combine these two functions in order to create get_all_signatures, a function that will create a time series dataset for the signatures of every petition we have from the WeThePeopleAPI. In order to make sure I call the API a proper number of times without encountering error, I set the number of queries as $\lceil \frac{n}{1000} \rceil$, where $n$ is the number of signatures in the petition.

def get_all_signatures(idx):
    #Get the desired number of API calls for the 
    num=math.ceil(petitions.signatureCount[idx]/1000)
    #Make all of the calls to the API
    sigtimes_list=[findSigs(petitions.id[idx],str(1000*i)) for i in range(num)]
    #Finally, concatenate the list of signature times into an array, and then convert it to a time series dataframe.
    signatures_ts=np.concatenate(sigtimes_list)
    ts_df=makeTS(signatures_ts,idx)
    return(ts_df)

We can finally performt he data extraction, simply by calling a list comprehension with the final function, and then using pd.concat to make the list of dataframes into a single dataframe.

Warning: The operation to extract all of the data is very slow, and may take up to 30 minutes. It is meant to be completed once. I recommend reading in the csv from github instead.

# petitions_timeseries=pd.concat([get_all_signatures(i) for i in petitions.index])

Webscraping Twitter¶

We also webscraped the 20 most recent tweets, straight from 20 political sources on Twitter, a site where many retrieve their political news. We manually compiled the URL and name of profile in the twitter_links.csv file.

We wanted to get a sampling of people from every position on the political spectrum, so our scraped profiles cover a variety of mainstream/fringe politicians and media figures. These included: Donald Trump, Betsy Devos, Kellyanne Conway, Mike Pence, Mitt Romney, Jeb Bush, Milo Yiannopoulos, Sarah Palin, Ted Cruz, Jerry Brown, Jill Stein, Barack Obama, Joe Biden, Bernie Sanders, Hillary Clinton, Robert Reich, Justin Trudeau, Nate Silver, NYT Politics, CNN Politics, FOX Politics, Post Politics, and We the People.

# LIBRARIES
import pandas as pd
from bs4 import BeautifulSoup
import requests

# IMPORT DATA
twitter = pd.read_csv("data/twitter_links.csv")
twitter.head(10)

# VIEW ALL OF THE SOURCES WE ARE SCRAPING FROM
print twitter["profile"].values

['Donald Trump' 'Betsy Devos' 'Kellyanne Conway' 'Mike Pence' 'Mitt Romney'
 'Jeb Bush' 'Milo Yiannopoulos' 'Sarah Palin' 'Ted Cruz' 'Jerry Brown'
 'Jill Stein' 'Barack Obama' 'Joe Biden' 'Bernie Sanders' 'Hillary Clinton'
 'Robert Reich' 'Justin Trudeau' 'Nate Silver' 'NYT Politics'
 'CNN Politics' 'FOX Politics' 'Post Politics' 'We the People']

The following is the function we used to retrieve the data we desired.

# FUNCTION
def get_tweet_bag(twitter_url):
    # PARSE PROFILE
    this_request = requests.get(twitter_url).text
    abc_soup = BeautifulSoup(this_request, "html.parser")

    # GRAB DATA FOR 20 TWEETS
    twenty_tweet_data = abc_soup.find_all("div", {"class", "js-tweet-text-container"})
    
    # GET THE 20 TWEETS FOR ONE PERSON
    twenty_tweets = [x.find_all("p")[0].text for x in twenty_tweet_data]
    twenty_tweets = [x.encode("ascii", "replace") for x in twenty_tweets]

    # CREATE BAG OF WORDS
    tweet_bag = " ".join(twenty_tweets)
    
    return(tweet_bag)

# LIST COMP TO RETRIEVE 20 TWEETS FROM EACH SOURCE
tweet_bags = [get_tweet_bag(x) for x in twitter["url"]]

This is the dataframe we constructed.

# ASSEMBLING DATAFRAME
twitter["tweet_bags"] = tweet_bags
# twitter.to_csv("data/twitter_data.csv")
twitter.head(10)

	profile	url	tweet_bags
0	Donald Trump	https://twitter.com/POTUS	FBI Director Comey: fmr. DNI Clapper "right" t...
1	Betsy Devos	https://twitter.com/BetsyDeVosED	"At the end of the day we should measure every...
2	Kellyanne Conway	https://twitter.com/KellyannePolls	Congratulations @erictrump & @LaraLeaTrump on ...
3	Mike Pence	https://twitter.com/mike_pence	.@POTUS showed true leadership in his #JointAd...
4	Mitt Romney	https://twitter.com/MittRomney	I'm a fan of proposed Deputy Treasury Secretar...
5	Jeb Bush	https://twitter.com/JebBush	Such an unnecessary distraction given all the ...
6	Milo Yiannopoulos	https://twitter.com/DontGoAwayM4d	http://bit.ly/2mRyeJq? via /r/KiA #gamergate K...
7	Sarah Palin	https://twitter.com/SarahPalinUSA	The best things happen while fishing. Love thi...
8	Ted Cruz	https://twitter.com/tedcruz	Add your name if you agree -- no US funding fo...
9	Jerry Brown	https://twitter.com/JerryBrownGov	California is Not Turning Back, Not Now, Not E...

	body	created	deadline	id	isPublic	isSignable	issues	petition_type	reachedPublic	signatureCount	signatureThreshold	signaturesNeeded	status	title	type	url
0	The Affordable Care Act (ACA) has enabled mill...	1488599633	1491188033	2516526	True	True	Health Care	Change an existing Administration policy	1490120785	155	100000	99845	open	Keep, but modify, the Affordable Care Act	petition	https://petitions.whitehouse.gov/petition/keep...
1	The "fake news" meme suppresses the ...	1487985976	1490574376	2511691	True	True	Civil Rights & Equality, Economy & Jobs	Call on Congress to act on an issue	1490103886	158	100000	99842	open	Google, et al., recent actions violate freedom...	petition	https://petitions.whitehouse.gov/petition/goog...
2	Stop SPRAYING the chemtrails. It's poisoning o...	1489957100	1492549100	2524366	True	True	Energy & Environment	Call on Congress to act on an issue	1490088738	335	100000	99665	open	STOP THE SPRAYING OF THE CHEMTRAILS	petition	https://petitions.whitehouse.gov/petition/stop...
3	Allow nation-wide concealed carry to any indiv...	1488552578	1491140978	2515941	True	True	Government & Regulatory Reform	Propose a new Administration policy	1490052376	173	100000	99827	open	Enact legislation to provide nation-wide recip...	petition	https://petitions.whitehouse.gov/petition/enac...
4	Sheriff Joe Arpaio is under assault from the L...	1489776659	1492368659	2523566	True	True	Criminal Justice Reform	Take or explain a position on an issue or policy	1489951075	217	100000	99783	open	PROTECT SHERIFF JOE ARPAIO FROM UNLAWFUL PROSE...	petition	https://petitions.whitehouse.gov/petition/prot...