Modified by SunJackson

Word Cloud From Reddit Comments Gilded 10 Or More Times • Jupyter Notebook

Word Cloud From Reddit Comments Gilded 10 Or More Times

In this notebook I'll create a word cloud visualization from words in reddit comments that were gilded at least 10 times from the beginning of reddit to February 2019. The comment texts come from the reddit dataset created and maintained by Jason Baumgartner (/u/Stuck_In_the_Matrix). The data can be queried via a web service on Pushshift.io or Felipe Hoffa's (/u/fhoffa) dataset made available on Google's Big Query service. I used the latter option running the following query, that processed 955 GB when run 2019-04-28.

SELECT author, body, gilded
FROM `fh-bigquery.reddit_comments.20*`
WHERE gilded > 9

It took only a few seconds, which is impressive considering the dataset comprises several billion records, but be aware that querying the whole reddit comment dataset can become expensive quickly.

In the code cell below, the Python libraries used in this notebook are imported, some variables set and the dataset loaded into a pandas DataFrame.

Update

I added code to work around the issue that the spaCy language model en_core_web_lg (version 2.0.0) does not identify stopwords.

%matplotlib inline
%load_ext signature
import os
import re

import pandas as pd
import matplotlib.pyplot as plt
import spacy

from collections import Counter

from imageio import imread
from wordcloud import WordCloud


font = '/usr/share/fonts/truetype/ubuntu-font-family/Ubuntu-B.ttf'
limit = 1000
mask = imread('img/reddit-alien-large-mask.png')
re_non_word = re.compile(r'[\W\s]')

nlp = spacy.load('en_core_web_lg')

# Workaround for issue with missing stopwords, see
# https://github.com/explosion/spaCy/issues/1574#issuecomment-346184948
for word in nlp.Defaults.stop_words:
 nlp.vocab[word].is_stop = True

df = pd.read_csv(os.path.expanduser('~/data/reddit/reddit_comments_gilded_min_10_20190428.csv'))
num_comments = len(df.body)
num_comments
523

In the next step the comment texts are tokenized using the spaCy library for natural language processing. Note that an English language model is used for all comments. Empty tokens and tokens recognized as stop words or which contain non-word characters except whitespace will be ignored.

Tokens identified as adverbs, interjections, nouns, verbs are lemmatized, i. e. the dictionary form is used not the inflected form. For example is and are are represented as a form of be. Adjectives and pronouns are lower-cased except for I and proper names upper-cased, so they can easily be distinguished from homographs with a different part of speech.

names = []
words = []

for idx, text in df.body.items():
 doc = nlp(text)
 for token in doc:
 if token.is_stop or not len(token.text) or re.search(re_non_word, token.text):
 continue
 if token.pos_ in ('ADV', 'INTJ', 'NOUN', 'VERB'):
 words.append(token.lemma_)
 elif token.pos_ in ('ADJ', 'PRON'):
 if token.text in ('i', 'I'):
 words.append(token.text.upper())
 else:
 words.append(token.text.lower())
 elif token.pos_ == 'PROPN':
 name = token.text.upper()
 names.append(name)
 words.append(name)

Next we create a Counter object from the standard library collections module passing it the list of words to be considered for the word cloud. Then a WordCloud object, setting the mask parameter to a black image of Snoo, reddit's alien mascot. That object's fit_words method then gets called with the Counter object.

freq = Counter(words)
wc = WordCloud(
 max_words=limit,
 mask=mask,
 background_color='#ffffff',
 font_path=font).fit_words(freq)

Finally, we use matplotlib to display the word cloud image in this notebook and add some explanatory text and credits to it.

annotation = '''The {} most frequent words in {} reddit comments gilded 10 or more times
until February 2019. Adverbs, interjections, nouns and verbs were lemmatized.
Proper names and "I" were upper-cased all other words lower-cased.
Author: Ramiro Gómez - ramiro.org
Data: Jason Baumgartner & Felipe Hoffa - reddit.com'''.format(limit, num_comments)

fig = plt.figure()
fig.set_figwidth(18)
fig.set_figheight(24)

plt.imshow(wc, interpolation='bilinear')
plt.annotate(annotation, xy=(0, 60), fontsize=12)
plt.axis('off')
plt.savefig('img/reddit-gilded-comments-wordcloud.png', bbox_inches='tight')

If you search for data visualization experts opinions on word clouds you'll probably come to the conclusion not to use them. There are certainly better ways to show frequencies, but I do think they have an artistic appeal and that's why I created this one.

That being said, if you are curious about which proper names occurred most often, check out the plain text output below. You can see that EDIT is frequently identified as a proper name. It is often used by redditors to show they edited the comment and to explain why or just to say thanks for the gold.

Counter(names).most_common(10)
[('EDIT', 289),
 ('TRUMP', 275),
 ('REDDIT', 158),
 ('US', 98),
 ('AMERICA', 94),
 ('RUSSIA', 72),
 ('CLINTON', 71),
 ('PUTIN', 53),
 ('TPP', 49),
 ('ISRAEL', 49)]
%signature

Author: Ramiro Gómez • Last edited: May 05, 2019Linux 5.0.0-13-generic - CPython 3.6.7 - IPython 7.5.0 - matplotlib 3.0.3 - numpy 1.14.3 - pandas 0.24.2

Shirts for Python Programmers

Published:May 03, 2019 by Ramiro Gómez. If you want to be notified about new content, click here to subscribe to the newsletter or RSS feed.

Disclosure: External links on this page may contain affiliate IDs, which means that I earn a commission if you make a purchase via such a link. This allows me to operate this site and offer hopefully valuable content that is freely accessible. More information about affiliate programs.

© Ramiro Gómez. Berlin, Germany.

Be informed about new content

Share this page

Hit ESC or click X on top right to close this dialog.

    标签:
微信扫一扫订阅