This project, although hosted on my site, was a collaborative project between myself and two outstanding others.
See Patrick Vacek's profile and Graham Smith's profile for data science brilliance.

View Project Page | Previous Step: Exploring Language of White House Petitions | Next Step: Applying Algorithm on Political Tweets
Fitting Random Forest to Vocabulary

Motivation

Recall that in the Data Acquisition step we knowledgeably classified petitions as either conservative, liberal, or neutral. Some bias on our end may have been inherent in this classification, but we are confident that with these classifications will be able to detect petition language bias as we desire.

We used a Random Forest model for our data, both because it's a very general approach, and because they are straightforward to run and explain. We chose it over similar methods like KNN because overall the error rate was much lower. You can view the code where we implemented KNN here. A possible modification of this model might be to use boosting instead, if we planned on running it many times sequentially since that method uses fewer trees and thus might theoretically run faster (along with aiding in interpretability since the models will be smaller).

Contents

  1. Libraries
  2. Functions
  3. Removing Neutrality
  4. Train the Model
  5. Classify Entire Dataframe
  6. Testing External Data

Libraries

Here are the libraries we used.

In [1]:
# LIBRARIES
# TEXT PROCESSING
import nltk
from nltk.corpus import stopwords
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# DATA SCIENCE
import pandas as pd
import numpy as np

# VISUALIZATION
from wordcloud import WordCloud
from matplotlib import pyplot as plt
get_ipython().magic(u'matplotlib inline')
plt.style.use("ggplot")

# ELSE
from collections import Counter
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

Functions

These functions are to clean up the text.

In [2]:
# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpText(text, additional_stopwords=[]):
    # REMOVE MARK UP
    new_text = text.replace("\r", "").replace("\n", "")
    # REMOVE URLS
    new_text = re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", "", new_text).strip()
    # REMOVE PUNCTUATION
    new_text = new_text.translate(None, string.punctuation)
    # REMOVE NUMBERS
    new_text = re.sub(r"\d+", "", new_text)
    # LOWERCASE
    new_text = new_text.lower()
    #SPLIT
    new_text = new_text.split()
    # REMOVE STOPWORDS
    stops = stopwords.words("english") + additional_stopwords
    return [word for word in new_text if word not in stops]



# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpTitle(text):
    # REMOVE MARK UP
    new_text = text.replace("\r", "").replace("\n", "")
    # REMOVE NUMBERS
    new_text = re.sub(r"\d+", "", new_text)
    # LOWERCASE
    new_text = new_text.lower()
    return(new_text)
In [3]:
import os
os.chdir("/Users/EDIE/Box Sync/GitThings/project141b/")
In [4]:
# READ IN DATA TO TRAIN MODEL
petnlp = pd.DataFrame.from_csv("data/petnlp.csv")

Removing Neutrality

We suspected that it would be beneficial to remove the neutral petitions from our model, since we’re really more interested in detecting bias than we are in predicting neutrality.

In [7]:
# REMOVE THE NEUTRALS
petnlp = petnlp[petnlp["ideology"]!="Neutral"]
In [8]:
petnlp.head()
Out[8]:
body issues petition_type title url ideology
0 It effects every American in some way. It wil... Budget & Taxes, Economy & Jobs, Veterans & Mil... Change an existing Administration policy Legalize Marijuana and bring jobs to millions ... https://petitions.whitehouse.gov/petition/lega... Liberal
3 We cannot make America great with so many disa... Economy & Jobs, Health Care, Technology & Inno... Propose a new Administration policy Take Action to End the Autism Epidemic and Imp... https://petitions.whitehouse.gov/petition/take... Conservative
5 The DEA has erroneously classified marijuana a... Civil Rights & Equality, Criminal Justice Refo... Change an existing Administration policy Make marijuana legalization for recreation or ... https://petitions.whitehouse.gov/petition/make... Liberal
7 Approximately 1 in 5 youth aged 13x18 suffer f... Health Care Call on Congress to act on an issue Take Action to Improve Screening for Mental Il... https://petitions.whitehouse.gov/petition/take... Liberal
9 In the long history of the White House with mu... Civil Rights & Equality, Government & Regulato... Take or explain a position on an issue or policy We strongly protest exclusion of news orgs. fr... https://petitions.whitehouse.gov/petition/we-s... Liberal

Train the Model

Let's train a Random Forest to the petition text.

Subsetting the Training Data

In [10]:
cleaned_titles = [cleanUpTitle(x) for x in petnlp["title"]]


# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")


tokens = [cleanUpText(x, new_stops) for x in petnlp["body"]]

blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":petnlp["ideology"]})


# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)
/Users/EDIE/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:17: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
In [11]:
# COLLECT TRAINING DATA INDICES
import random

# len(blobs_df)*0.8 = 174.4
pseudo_rando_nums = random.sample(range(0, len(blobs_df)-1), 174)

train = pseudo_rando_nums
# print train
# train = [195, 214, 150, 177, 107, 112, 21, 202, 129, 199, 162, 63, 16, 160, 87, 45, 58, 97, 106, 136, 146, 41, 159, 131, 105, 120, 0, 13, 81, 194, 83, 141, 173, 175, 51, 86, 17, 180, 137, 89, 164, 126, 191, 65, 170, 185, 12, 33, 140, 67, 124, 114, 133, 125, 165, 187, 178, 113, 64, 181, 115, 138, 27, 77, 139, 35, 29, 102, 50, 88, 74, 7, 84, 26, 152, 184, 93, 20, 46, 28, 121, 62, 18, 171, 108, 22, 122, 149, 156, 47, 53, 98, 110, 8, 205, 85, 161, 166, 134, 66, 61, 101, 183, 24, 42, 211, 94, 23, 11, 55, 148, 44, 201, 135, 76, 91, 196, 70, 30, 143, 57, 151, 130, 72, 153, 116, 163, 100, 6, 75, 19, 14, 197, 109, 15, 4, 54, 43, 68, 90, 144, 128, 99, 142, 204, 1, 60, 10, 188, 207, 79, 25, 176, 34, 132, 158, 39, 31, 80, 37, 103, 123, 119, 3, 217, 154, 9, 96, 127, 172, 117, 59, 95, 179]
In [12]:
# VECTORIZING TRAINING DATA
vectorizer = CountVectorizer(analyzer= "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = None,
                            max_features = 5000)
train_data_features = vectorizer.fit_transform(blobs_df["blobs"][train])
train_data_features = train_data_features.toarray()


# GENERATE OUR FOREST
forest = RandomForestClassifier(n_estimators = 100)
our_forest = forest.fit(train_data_features, blobs_df["ideology"][train])

Subsetting the Test Data

In [13]:
# GET INDICES OF TEST DATA
test = set(range(len(blobs_df))) - set(train)
test = list(test)

# VECTORIZING TEST DATA
test_data_features = vectorizer.transform(blobs_df["blobs"][test])
test_data_features = test_data_features.toarray()


# PREDICT
result = forest.predict(test_data_features)
pred_df = pd.DataFrame({"petition":blobs_df["title"][test], "true_ideol":blobs_df["ideology"][test], "pred_ideol":result})
pred_df = pred_df.reindex()

Calculating Error Rate

To test if our suspicions about removing neutrality were correct, we first calculated the misclassification rate only looking at ‘conservative’ or ‘liberal’ petitions and got 27.27%. This is surprisingly good, given that our model is relatively simple and only looks at vocabulary choice.

In [14]:
# CALCULATE ERROR
pred_df["correct"] = (pred_df["pred_ideol"] == pred_df["true_ideol"])

# ERROR RATE
len(pred_df[pred_df["correct"]==False])*(len(pred_df)**(-1))
Out[14]:
0.2727272727272727

Classify Entire Dataframe

When we classify the entire dataframe by only the labels of "conservative" or "liberal", we get a 48.67% error rate.

In [15]:
# NOW TAKE IN ALL DATA
full_petnlp = pd.DataFrame.from_csv("data/petnlp.csv")
In [16]:
# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")


# TOKENS
tokens = [cleanUpText(x, new_stops) for x in full_petnlp["body"]]
cleaned_titles = [cleanUpTitle(x) for x in full_petnlp["title"]]

blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":full_petnlp["ideology"]})


# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)
/Users/EDIE/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:17: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
In [17]:
blobs_df.head()
Out[17]:
blobs ideology title
0 effects way keep funding terrorism drug cartel... Liberal legalize marijuana and bring jobs to millions ...
1 dying days ottoman empire cover world war youn... Neutral officially recognize the armenian genocide of
2 walleye political pawns mille lacs lakewhile n... Neutral eliminate mille lacs lake treaty management
3 cannot great disabled autistic children urgent... Conservative take action to end the autism epidemic and imp...
4 locked years prison exercising constitutional ... Neutral release kevin trudeau
In [18]:
new_text__data_features = vectorizer.transform(blobs_df["blobs"])
new_text__data_features = new_text__data_features.toarray()
new_text__data_features


# PREDICT NEW TEXT DATA
new_text_result = forest.predict(new_text__data_features)

# PREDICTED NEW TEXT DATA PROBABILITIES
new_text_predicted_probs = forest.predict_proba(new_text__data_features)
results_df = pd.DataFrame(new_text_predicted_probs)
results_df.columns = ["conserative", "liberal"]
results_df["prediction"] = results_df.idxmax(axis=1)
results_df["true"] = [x.lower() for x in blobs_df["ideology"]]
results_df.head()
Out[18]:
conserative liberal prediction true
0 0.13 0.87 liberal liberal
1 0.20 0.80 liberal neutral
2 0.37 0.63 liberal neutral
3 0.65 0.35 conserative conservative
4 0.40 0.60 liberal neutral

Calculating Error Rate

As expected, the error rate is higher for classifying the entire dataframe because we cannot classify neutral documents as neutral. We realize this as an issue for when we test our model against the listed "true ideology". However, we can easily see that some language choices of "truly neutral" ideology actually sway more liberal than conservative and vice versa.

Our error rate when classifying the entire dataframe as either liberal or conservative is 48.67%.

In [19]:
# CALCULATE ERROR
results_df["correct"] = (results_df["prediction"] == results_df["true"])
In [20]:
# ERROR RATE
len(results_df[results_df["correct"]==False])*(len(results_df)**(-1))
Out[20]:
0.4866920152091255


View Project Page | Previous Step: Exploring Language of White House Petitions | Next Step: Applying Algorithm on Political Tweets