As a preliminary analysis, we’ll make some basic histograms of the words commonly used by the different political ideologies. Now we can make some simple histograms to see what the most common words used by different petitions.
# LIBERAL WORDS
liberal_text2 = " ".join([x for x in cleaned_text_df[cleaned_text_df["Ideology"] == "Liberal"]["New Text"]])
liberal_counts2 = Counter(liberal_text2.split())
liberal_counts2.most_common()[0:20]
df = pd.DataFrame.from_dict(liberal_counts2, orient='index').reset_index()
df = df.sort_values(by=0, ascending=False)
word = list(df.ix[:,0])[0:20]
count = list(df.ix[:,1])[0:20]
word_indices = np.arange(20) # ORDERED INDICES
# PLOTTING
plt.figure(figsize=(20,10))
plt.bar(word_indices, count, color="royalblue")
plt.xticks(word_indices, word, rotation=70)
plt.ylabel("Absolute Frequency")
plt.title("Absolute Frequency of Words in Liberal Petitions")
plt.show()
# CONSERVATIVE
conservative_text2 = " ".join([x for x in cleaned_text_df[cleaned_text_df["Ideology"] == "Conservative"]["New Text"]])
conservative_counts2 = Counter(conservative_text2.split())
conservative_counts2.most_common()[0:20]
df = pd.DataFrame.from_dict(conservative_counts2, orient='index').reset_index()
df = df.sort_values(by=0, ascending=False)
word = list(df.ix[:,0])[0:20]
count = list(df.ix[:,1])[0:20]
word_indices = np.arange(20) # ORDERED INDICES
# PLOTTING
plt.figure(figsize=(20,10))
plt.bar(word_indices, count, color="darkred")
plt.xticks(word_indices, word, rotation=70)
plt.ylabel("Absolute Frequency")
plt.title("Absolute Frequency of Words in Conservative Petitions")
plt.show()
From a cursory examination of the data, there doesn't appear to be anything especially 'liberal' about the Liberal words or 'conservative' about the Conservative ones. With the exceptions a couple standouts ('guns' and 'climate'). what's unusual is that most of these are fairly common words, and yet there is little overlap between opposite ends of the political spectrum. This leads us to think that there's likely a lot of value to be had in creating an ideology classifier, since there might be a lot of obvious signs a computer would catch that simply go over the head of humans.