Recall that in the Data Acquisition step we knowledgeably classified petitions as either conservative, liberal, or neutral. Some bias on our end may have been inherent in this classification, but we are confident that with these classifications will be able to detect petition language bias as we desire.
We used a Random Forest model for our data, both because it's a very general approach, and because they are straightforward to run and explain. We chose it over similar methods like KNN because overall the error rate was much lower. You can view the code where we implemented KNN here. A possible modification of this model might be to use boosting instead, if we planned on running it many times sequentially since that method uses fewer trees and thus might theoretically run faster (along with aiding in interpretability since the models will be smaller).
Here are the libraries we used.
# LIBRARIES
# TEXT PROCESSING
import nltk
from nltk.corpus import stopwords
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
# DATA SCIENCE
import pandas as pd
import numpy as np
# VISUALIZATION
from wordcloud import WordCloud
from matplotlib import pyplot as plt
get_ipython().magic(u'matplotlib inline')
plt.style.use("ggplot")
# ELSE
from collections import Counter
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
These functions are to clean up the text.
# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpText(text, additional_stopwords=[]):
# REMOVE MARK UP
new_text = text.replace("\r", "").replace("\n", "")
# REMOVE URLS
new_text = re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", "", new_text).strip()
# REMOVE PUNCTUATION
new_text = new_text.translate(None, string.punctuation)
# REMOVE NUMBERS
new_text = re.sub(r"\d+", "", new_text)
# LOWERCASE
new_text = new_text.lower()
#SPLIT
new_text = new_text.split()
# REMOVE STOPWORDS
stops = stopwords.words("english") + additional_stopwords
return [word for word in new_text if word not in stops]
# FUNCTION TO CLEAN UP BODY TEXT
def cleanUpTitle(text):
# REMOVE MARK UP
new_text = text.replace("\r", "").replace("\n", "")
# REMOVE NUMBERS
new_text = re.sub(r"\d+", "", new_text)
# LOWERCASE
new_text = new_text.lower()
return(new_text)
import os
os.chdir("/Users/EDIE/Box Sync/GitThings/project141b/")
# READ IN DATA TO TRAIN MODEL
petnlp = pd.DataFrame.from_csv("data/petnlp.csv")
We suspected that it would be beneficial to remove the neutral petitions from our model, since we’re really more interested in detecting bias than we are in predicting neutrality.
# REMOVE THE NEUTRALS
petnlp = petnlp[petnlp["ideology"]!="Neutral"]
petnlp.head()
cleaned_titles = [cleanUpTitle(x) for x in petnlp["title"]]
# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")
tokens = [cleanUpText(x, new_stops) for x in petnlp["body"]]
blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":petnlp["ideology"]})
# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)
# COLLECT TRAINING DATA INDICES
import random
# len(blobs_df)*0.8 = 174.4
pseudo_rando_nums = random.sample(range(0, len(blobs_df)-1), 174)
train = pseudo_rando_nums
# print train
# train = [195, 214, 150, 177, 107, 112, 21, 202, 129, 199, 162, 63, 16, 160, 87, 45, 58, 97, 106, 136, 146, 41, 159, 131, 105, 120, 0, 13, 81, 194, 83, 141, 173, 175, 51, 86, 17, 180, 137, 89, 164, 126, 191, 65, 170, 185, 12, 33, 140, 67, 124, 114, 133, 125, 165, 187, 178, 113, 64, 181, 115, 138, 27, 77, 139, 35, 29, 102, 50, 88, 74, 7, 84, 26, 152, 184, 93, 20, 46, 28, 121, 62, 18, 171, 108, 22, 122, 149, 156, 47, 53, 98, 110, 8, 205, 85, 161, 166, 134, 66, 61, 101, 183, 24, 42, 211, 94, 23, 11, 55, 148, 44, 201, 135, 76, 91, 196, 70, 30, 143, 57, 151, 130, 72, 153, 116, 163, 100, 6, 75, 19, 14, 197, 109, 15, 4, 54, 43, 68, 90, 144, 128, 99, 142, 204, 1, 60, 10, 188, 207, 79, 25, 176, 34, 132, 158, 39, 31, 80, 37, 103, 123, 119, 3, 217, 154, 9, 96, 127, 172, 117, 59, 95, 179]
# VECTORIZING TRAINING DATA
vectorizer = CountVectorizer(analyzer= "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
train_data_features = vectorizer.fit_transform(blobs_df["blobs"][train])
train_data_features = train_data_features.toarray()
# GENERATE OUR FOREST
forest = RandomForestClassifier(n_estimators = 100)
our_forest = forest.fit(train_data_features, blobs_df["ideology"][train])
# GET INDICES OF TEST DATA
test = set(range(len(blobs_df))) - set(train)
test = list(test)
# VECTORIZING TEST DATA
test_data_features = vectorizer.transform(blobs_df["blobs"][test])
test_data_features = test_data_features.toarray()
# PREDICT
result = forest.predict(test_data_features)
pred_df = pd.DataFrame({"petition":blobs_df["title"][test], "true_ideol":blobs_df["ideology"][test], "pred_ideol":result})
pred_df = pred_df.reindex()
To test if our suspicions about removing neutrality were correct, we first calculated the misclassification rate only looking at ‘conservative’ or ‘liberal’ petitions and got 27.27%. This is surprisingly good, given that our model is relatively simple and only looks at vocabulary choice.
# CALCULATE ERROR
pred_df["correct"] = (pred_df["pred_ideol"] == pred_df["true_ideol"])
# ERROR RATE
len(pred_df[pred_df["correct"]==False])*(len(pred_df)**(-1))
When we classify the entire dataframe by only the labels of "conservative" or "liberal", we get a 48.67% error rate.
# NOW TAKE IN ALL DATA
full_petnlp = pd.DataFrame.from_csv("data/petnlp.csv")
# DEFINE NEW STOP WORDS
new_stops = "President, president, people, without, needs, since, used, get, would, us, united, states, people, american, americans, national, government, petition, make, also, many, must, need, change, ask, use, every, trump, white, house, america, America, executive, Executive"
new_stops = new_stops.split(", ")
# TOKENS
tokens = [cleanUpText(x, new_stops) for x in full_petnlp["body"]]
cleaned_titles = [cleanUpTitle(x) for x in full_petnlp["title"]]
blobs = [unicode(" ".join(x), errors="replace") for x in tokens]
blobs = [x.encode("ascii", "replace") for x in blobs]
blobs_df = pd.DataFrame({"title":cleaned_titles, "blobs":blobs, "ideology":full_petnlp["ideology"]})
# REMOVE THE NULLS
index = blobs_df["ideology"].index[blobs_df["ideology"].apply(pd.isnull)]
blobs_df = blobs_df[~pd.isnull(blobs_df["ideology"])]
blobs_df = blobs_df.reset_index(drop=True)
blobs_df.head()
new_text__data_features = vectorizer.transform(blobs_df["blobs"])
new_text__data_features = new_text__data_features.toarray()
new_text__data_features
# PREDICT NEW TEXT DATA
new_text_result = forest.predict(new_text__data_features)
# PREDICTED NEW TEXT DATA PROBABILITIES
new_text_predicted_probs = forest.predict_proba(new_text__data_features)
results_df = pd.DataFrame(new_text_predicted_probs)
results_df.columns = ["conserative", "liberal"]
results_df["prediction"] = results_df.idxmax(axis=1)
results_df["true"] = [x.lower() for x in blobs_df["ideology"]]
results_df.head()
As expected, the error rate is higher for classifying the entire dataframe because we cannot classify neutral documents as neutral. We realize this as an issue for when we test our model against the listed "true ideology". However, we can easily see that some language choices of "truly neutral" ideology actually sway more liberal than conservative and vice versa.
Our error rate when classifying the entire dataframe as either liberal or conservative is 48.67%.
# CALCULATE ERROR
results_df["correct"] = (results_df["prediction"] == results_df["true"])
# ERROR RATE
len(results_df[results_df["correct"]==False])*(len(results_df)**(-1))