Our first step is to compile the data we will be working with, which we retrieved from two sources: the We The People API for White House petition data, and from Twitter profile pages. These seemed like decent proxies for the metrics we intended to look for, as White House petitions can be submitted by anyone and twitter is a reasonably acceptable barometer for public opinion (although it tends to skew more liberal which makes it far from ideal).
Using the We the People API, we were able to extract petition data for several hundred petitions. The information we retrieved includes: (1) the petition ID, (2) date created, (3) deadline date, (4) petition title, (5) issues addressed, (6) petition type, (7) signature count, (8) the petition itself, and (9) the URL to the petition.
# LIBRARIES
from fastcache import clru_cache
from datetime import datetime as dt
import requests
import numpy as np
import re
import pandas as pd
import math
np.set_printoptions(suppress=True)
I will be using clru_cache, which is a function cache, in order to cache queries. This means I need to make sure the function has as few unique parameters as possible. The following are the functions we used to retrieve the data we desired.
key="ETjU0uiiXFfA9AqvBUFooOEx2OmBdeq0nquzM1k4"
# API EXTRACTION FUNCTION
@clru_cache(maxsize=128,typed=False)
def get_petitions(limit="1000"):
"""
get_petitions: extract petitions from the WeThePeople API
INPUT: limit(string): Number of petitions to return.
OUTPUT: request_json(dict): A dictionary containing the JSON file"""
base="https://api.whitehouse.gov/v1/petitions.json?limit="+limit+"&sortBy=date_reached_public"+"&sortOrder=desc"
request_get=requests.get(base+"&api_key="+key)
request_json=request_get.json()
return(request_json)
It should be noted that we can convert a dictionary into a dataframe simply by calling Pd.DataFrame. As we can see, the petitions are stored under the "results" key. If we call pd.DataFrame on this object in the future, it will make it easier to work with.
get_petitions("1")["results"]
This is the dataframe we constructed.
Our next step is converting all of the live petitions into an easy to read dataframe. To do this, we will need to create a function that handles the JSON and does a couple of string manipulation techniques to clean the data.
def json_to_df(json):
#Create a dataframe from the dictionary, some values will be raw and need conversion
pet_df=pd.DataFrame(json)
#The issues and petition type fields need fixing
pet_df.issues=pd.Series([", ".join([re.sub("amp;","",issue["name"]) for issue in item["issues"]]) for item in json])
pet_df.petition_type=pd.Series(["".join([category["name"] for category in item["petition_type"]]) for item in json])
pet_df = pet_df.drop('response', 1)
return(pet_df)
Finally, we can run our functions to generate the data frame. We will be using the petition id for the next step. I've included an optional cell that writes the data if necessary.
json_petitions=get_petitions()
petitions=json_to_df(json_petitions["results"])
petitions.head()
#petitions.to_csv("petitions.csv")
I ended up labelling the petitions as "Liberal" or "Conservative" myself, which may introduce some bias. Nonetheless, additional data was acquired and compared to measure the effectiveness of the model, which may help reduce the effect of the bias.
The next step in the process is acquiring and making data that I could use for the creation of my model predicting petition success. We start off by creating another function to query the signatures portion of the API. We now need to call the API by petition id, and I also introduce a new variable called offset because many signatures have a significant amount of petitions that cannot be extracted in a single query.
@clru_cache(maxsize=128,typed=False)
def get_signatures(pet_id,limit="1000",offset="0"):
"""
get_signatures: Extract a set of signatures from a WeThePeople petition
INPUT:
pet_id(string): The petition id from the petitions section of the API
limit(string): The number of signatures to return
offset(string): The index at which to start on the list of petitions
OUTPUT:
request_json(dict): A dictionary containing the JSON file
"""
base="https://api.whitehouse.gov/v1/petitions/"+pet_id+"/signatures.json?limit="+limit+"&offset="+offset
request_get=requests.get(base+"&api_key="+prodkey)
request_json=request_get.json()
return(request_json)
As we saw in the petitions dataframe, the time variables "created" and "deadline" are not in a conventional format. They are actually in UNIX time. Before we get to extracting the signatures, we need to have functions for converting the "created" and "deadline" variables to and from UNIX time.
# UNIX TIME CONVERSION BORROWED FROM STACK
def convertUNIXtime(time):
"""Takes a unix time integer and converts it into a date string."""
new_time=dt.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S')
return(new_time)
# DATE BACK INTO UNIX
def convertDate(time):
"""Takes a date string and converts it into a unix time integer."""
new_time=int(dt.strptime(time,"%Y-%m-%d %H:%M:%S").timestamp())
return(new_time)
Now that we are equipped with the proper tools for handling the data, we can start writing functions to make repeated calls to the API for signature extraction. We want to start by writing a simple helper function, findSigs, this will call the API once at a given offset, and then return an array of the corresponding UNIX times at which the petition was signed.
def findSigs(petition,offset):
"""findSigs: call the signatures API once at a given offset
INPUT:
petition(string): The petition id
offset(string): The index of the list of petition signatures
OUTPUT:
sig_times(array): An array containing UNIX time values at which the petition was signed"""
sig_times=np.array(pd.DataFrame(get_signatures(petition,offset=offset)["results"]).created.values,dtype='int64')
return(sig_times)
Once we have a set of signatures, we would like to make it into time series data. The makeTS function will do this for us, and it serves as a helper function for the final, generalized function afterwards. I've explained the strategy on how to convert the UNIX times to time series data within the code, step-by-step.
def makeTS(data,pet_idx):
"""
makeTS: Convert the set of all signatures times into an hourly time series dataframe.
INPUT:
data(array): An integer array containing UNIX times of all petition signatures
pet_idex(int): The index on the petitions data frame, used for determining the deadline.
OUTPUT:
ts_df(dataframe): A pandas dataframe containing the cumulative signatures, by-hour signatures, days left, and id variables.
"""
#Find the number of time periods for the petition
num_hours=math.ceil((np.max(data)-np.min(data))/3600)+2
#Find the date of the first hour of the petition
blank_bin=re.sub("[0-9]+:[0-9]+$","",convertUNIXtime(np.min(data)))+"00:00"
#Generate the time intervals for each hour that the petition has been running
unix_intervals=[convertDate(blank_bin)+i*3600 for i in range(1,num_hours)]
#Create the days left variable by subtracting each unix time by hour from the deadline
days_left=np.round((petitions.deadline[pet_idx]-unix_intervals)/(60*60*24),decimals=2)
#Derive the cumulative signatures by hour, and difference them to obtain the hourly signature counts
cumulative_sigs=[len(np.extract(data<=ui,data)) for ui in unix_intervals]
sigs_diff=np.diff(np.append([0],cumulative_sigs))
#Find the corresponding hours for the time periods and use it as an index
hours=[convertUNIXtime(ui) for ui in unix_intervals]
#Finally, make a data frame with all these variables
ts_df=pd.DataFrame({"total":cumulative_sigs,"value":sigs_diff,"days_left":days_left,
"id":[pet_idx]*len(days_left)},index=hours)
return(ts_df)
We can combine these two functions in order to create get_all_signatures, a function that will create a time series dataset for the signatures of every petition we have from the WeThePeopleAPI. In order to make sure I call the API a proper number of times without encountering error, I set the number of queries as $\lceil \frac{n}{1000} \rceil$, where $n$ is the number of signatures in the petition.
def get_all_signatures(idx):
#Get the desired number of API calls for the
num=math.ceil(petitions.signatureCount[idx]/1000)
#Make all of the calls to the API
sigtimes_list=[findSigs(petitions.id[idx],str(1000*i)) for i in range(num)]
#Finally, concatenate the list of signature times into an array, and then convert it to a time series dataframe.
signatures_ts=np.concatenate(sigtimes_list)
ts_df=makeTS(signatures_ts,idx)
return(ts_df)
We can finally performt he data extraction, simply by calling a list comprehension with the final function, and then using pd.concat to make the list of dataframes into a single dataframe.
Warning: The operation to extract all of the data is very slow, and may take up to 30 minutes. It is meant to be completed once. I recommend reading in the csv from github instead.
# petitions_timeseries=pd.concat([get_all_signatures(i) for i in petitions.index])
We also webscraped the 20 most recent tweets, straight from 20 political sources on Twitter, a site where many retrieve their political news. We manually compiled the URL and name of profile in the twitter_links.csv file.
We wanted to get a sampling of people from every position on the political spectrum, so our scraped profiles cover a variety of mainstream/fringe politicians and media figures. These included: Donald Trump, Betsy Devos, Kellyanne Conway, Mike Pence, Mitt Romney, Jeb Bush, Milo Yiannopoulos, Sarah Palin, Ted Cruz, Jerry Brown, Jill Stein, Barack Obama, Joe Biden, Bernie Sanders, Hillary Clinton, Robert Reich, Justin Trudeau, Nate Silver, NYT Politics, CNN Politics, FOX Politics, Post Politics, and We the People.
# LIBRARIES
import pandas as pd
from bs4 import BeautifulSoup
import requests
# IMPORT DATA
twitter = pd.read_csv("data/twitter_links.csv")
twitter.head(10)
# VIEW ALL OF THE SOURCES WE ARE SCRAPING FROM
print twitter["profile"].values
The following is the function we used to retrieve the data we desired.
# FUNCTION
def get_tweet_bag(twitter_url):
# PARSE PROFILE
this_request = requests.get(twitter_url).text
abc_soup = BeautifulSoup(this_request, "html.parser")
# GRAB DATA FOR 20 TWEETS
twenty_tweet_data = abc_soup.find_all("div", {"class", "js-tweet-text-container"})
# GET THE 20 TWEETS FOR ONE PERSON
twenty_tweets = [x.find_all("p")[0].text for x in twenty_tweet_data]
twenty_tweets = [x.encode("ascii", "replace") for x in twenty_tweets]
# CREATE BAG OF WORDS
tweet_bag = " ".join(twenty_tweets)
return(tweet_bag)
# LIST COMP TO RETRIEVE 20 TWEETS FROM EACH SOURCE
tweet_bags = [get_tweet_bag(x) for x in twitter["url"]]
This is the dataframe we constructed.
# ASSEMBLING DATAFRAME
twitter["tweet_bags"] = tweet_bags
# twitter.to_csv("data/twitter_data.csv")
twitter.head(10)