Motivation¶

Recall that in the Data Acquisition step we knowledgeably classified petitions as either conservative, liberal, or neutral. Some bias on our end may have been inherent in this classification, but we are confident that with these classifications will be able to detect petition language bias as we desire.

We used a Random Forest model for our data, both because it's a very general approach, and because they are straightforward to run and explain. We chose it over similar methods like KNN because overall the error rate was much lower. You can view the code where we implemented KNN here. A possible modification of this model might be to use boosting instead, if we planned on running it many times sequentially since that method uses fewer trees and thus might theoretically run faster (along with aiding in interpretability since the models will be smaller).

Contents¶

Libraries
Functions
Removing Neutrality
Train the Model
Classify Entire Dataframe
Testing External Data

Libraries¶

Here are the libraries we used.

# LIBRARIES
# TEXT PROCESSING
import nltk
from nltk.corpus import stopwords
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# DATA SCIENCE
import pandas as pd
import numpy as np

# VISUALIZATION
from wordcloud import WordCloud
from matplotlib import pyplot as plt
get_ipython().magic(u'matplotlib inline')
plt.style.use("ggplot")

# ELSE
from collections import Counter
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

Functions¶

These functions are to clean up the text.

# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpText(text, additional_stopwords=[]):
    # REMOVE MARK UP
    new_text = text.replace("\r", "").replace("\n", "")
    # REMOVE URLS
    new_text = re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", "", new_text).strip()
    # REMOVE PUNCTUATION
    new_text = new_text.translate(None, string.punctuation)
    # REMOVE NUMBERS
    new_text = re.sub(r"\d+", "", new_text)
    # LOWERCASE
    new_text = new_text.lower()
    #SPLIT
    new_text = new_text.split()
    # REMOVE STOPWORDS
    stops = stopwords.words("english") + additional_stopwords
    return [word for word in new_text if word not in stops]



# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpTitle(text):
    # REMOVE MARK UP
    new_text = text.replace("\r", "").replace("\n", "")
    # REMOVE NUMBERS
    new_text = re.sub(r"\d+", "", new_text)
    # LOWERCASE
    new_text = new_text.lower()
    return(new_text)

import os
os.chdir("/Users/EDIE/Box Sync/GitThings/project141b/")

# READ IN DATA TO TRAIN MODEL
petnlp = pd.DataFrame.from_csv("data/petnlp.csv")

Removing Neutrality¶

We suspected that it would be beneficial to remove the neutral petitions from our model, since we’re really more interested in detecting bias than we are in predicting neutrality.

# REMOVE THE NEUTRALS
petnlp = petnlp[petnlp["ideology"]!="Neutral"]

petnlp.head()

Train the Model¶

Let's train a Random Forest to the petition text.

Subsetting the Training Data¶

cleaned_titles = [cleanUpTitle(x) for x in petnlp["title"]]


# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")


tokens = [cleanUpText(x, new_stops) for x in petnlp["body"]]

blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":petnlp["ideology"]})


# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)

/Users/EDIE/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:17: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

# COLLECT TRAINING DATA INDICES
import random

# len(blobs_df)*0.8 = 174.4
pseudo_rando_nums = random.sample(range(0, len(blobs_df)-1), 174)

train = pseudo_rando_nums
# print train
# train = [195, 214, 150, 177, 107, 112, 21, 202, 129, 199, 162, 63, 16, 160, 87, 45, 58, 97, 106, 136, 146, 41, 159, 131, 105, 120, 0, 13, 81, 194, 83, 141, 173, 175, 51, 86, 17, 180, 137, 89, 164, 126, 191, 65, 170, 185, 12, 33, 140, 67, 124, 114, 133, 125, 165, 187, 178, 113, 64, 181, 115, 138, 27, 77, 139, 35, 29, 102, 50, 88, 74, 7, 84, 26, 152, 184, 93, 20, 46, 28, 121, 62, 18, 171, 108, 22, 122, 149, 156, 47, 53, 98, 110, 8, 205, 85, 161, 166, 134, 66, 61, 101, 183, 24, 42, 211, 94, 23, 11, 55, 148, 44, 201, 135, 76, 91, 196, 70, 30, 143, 57, 151, 130, 72, 153, 116, 163, 100, 6, 75, 19, 14, 197, 109, 15, 4, 54, 43, 68, 90, 144, 128, 99, 142, 204, 1, 60, 10, 188, 207, 79, 25, 176, 34, 132, 158, 39, 31, 80, 37, 103, 123, 119, 3, 217, 154, 9, 96, 127, 172, 117, 59, 95, 179]

# VECTORIZING TRAINING DATA
vectorizer = CountVectorizer(analyzer= "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = None,
                            max_features = 5000)
train_data_features = vectorizer.fit_transform(blobs_df["blobs"][train])
train_data_features = train_data_features.toarray()


# GENERATE OUR FOREST
forest = RandomForestClassifier(n_estimators = 100)
our_forest = forest.fit(train_data_features, blobs_df["ideology"][train])

Subsetting the Test Data¶

# GET INDICES OF TEST DATA
test = set(range(len(blobs_df))) - set(train)
test = list(test)

# VECTORIZING TEST DATA
test_data_features = vectorizer.transform(blobs_df["blobs"][test])
test_data_features = test_data_features.toarray()


# PREDICT
result = forest.predict(test_data_features)
pred_df = pd.DataFrame({"petition":blobs_df["title"][test], "true_ideol":blobs_df["ideology"][test], "pred_ideol":result})
pred_df = pred_df.reindex()

Calculating Error Rate¶

To test if our suspicions about removing neutrality were correct, we first calculated the misclassification rate only looking at ‘conservative’ or ‘liberal’ petitions and got 27.27%. This is surprisingly good, given that our model is relatively simple and only looks at vocabulary choice.

# CALCULATE ERROR
pred_df["correct"] = (pred_df["pred_ideol"] == pred_df["true_ideol"])

# ERROR RATE
len(pred_df[pred_df["correct"]==False])*(len(pred_df)**(-1))

0.2727272727272727

Classify Entire Dataframe¶

When we classify the entire dataframe by only the labels of "conservative" or "liberal", we get a 48.67% error rate.

# NOW TAKE IN ALL DATA
full_petnlp = pd.DataFrame.from_csv("data/petnlp.csv")

# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")


# TOKENS
tokens = [cleanUpText(x, new_stops) for x in full_petnlp["body"]]
cleaned_titles = [cleanUpTitle(x) for x in full_petnlp["title"]]

blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":full_petnlp["ideology"]})


# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)

/Users/EDIE/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:17: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

blobs_df.head()

new_text__data_features = vectorizer.transform(blobs_df["blobs"])
new_text__data_features = new_text__data_features.toarray()
new_text__data_features


# PREDICT NEW TEXT DATA
new_text_result = forest.predict(new_text__data_features)

# PREDICTED NEW TEXT DATA PROBABILITIES
new_text_predicted_probs = forest.predict_proba(new_text__data_features)
results_df = pd.DataFrame(new_text_predicted_probs)
results_df.columns = ["conserative", "liberal"]
results_df["prediction"] = results_df.idxmax(axis=1)
results_df["true"] = [x.lower() for x in blobs_df["ideology"]]
results_df.head()

Calculating Error Rate¶

As expected, the error rate is higher for classifying the entire dataframe because we cannot classify neutral documents as neutral. We realize this as an issue for when we test our model against the listed "true ideology". However, we can easily see that some language choices of "truly neutral" ideology actually sway more liberal than conservative and vice versa.

Our error rate when classifying the entire dataframe as either liberal or conservative is 48.67%.

# CALCULATE ERROR
results_df["correct"] = (results_df["prediction"] == results_df["true"])

# ERROR RATE
len(results_df[results_df["correct"]==False])*(len(results_df)**(-1))

0.4866920152091255

	body	issues	petition_type	title	url	ideology
0	It effects every American in some way. It wil...	Budget & Taxes, Economy & Jobs, Veterans & Mil...	Change an existing Administration policy	Legalize Marijuana and bring jobs to millions ...	https://petitions.whitehouse.gov/petition/lega...	Liberal
3	We cannot make America great with so many disa...	Economy & Jobs, Health Care, Technology & Inno...	Propose a new Administration policy	Take Action to End the Autism Epidemic and Imp...	https://petitions.whitehouse.gov/petition/take...	Conservative
5	The DEA has erroneously classified marijuana a...	Civil Rights & Equality, Criminal Justice Refo...	Change an existing Administration policy	Make marijuana legalization for recreation or ...	https://petitions.whitehouse.gov/petition/make...	Liberal
7	Approximately 1 in 5 youth aged 13x18 suffer f...	Health Care	Call on Congress to act on an issue	Take Action to Improve Screening for Mental Il...	https://petitions.whitehouse.gov/petition/take...	Liberal
9	In the long history of the White House with mu...	Civil Rights & Equality, Government & Regulato...	Take or explain a position on an issue or policy	We strongly protest exclusion of news orgs. fr...	https://petitions.whitehouse.gov/petition/we-s...	Liberal

	blobs	ideology	title
0	effects way keep funding terrorism drug cartel...	Liberal	legalize marijuana and bring jobs to millions ...
1	dying days ottoman empire cover world war youn...	Neutral	officially recognize the armenian genocide of
2	walleye political pawns mille lacs lakewhile n...	Neutral	eliminate mille lacs lake treaty management
3	cannot great disabled autistic children urgent...	Conservative	take action to end the autism epidemic and imp...
4	locked years prison exercising constitutional ...	Neutral	release kevin trudeau

	conserative	liberal	prediction	true
0	0.13	0.87	liberal	liberal
1	0.20	0.80	liberal	neutral
2	0.37	0.63	liberal	neutral
3	0.65	0.35	conserative	conservative
4	0.40	0.60	liberal	neutral