Word Cloud From Reddit Comments Gilded 10 Or More Times
In this notebook I'll create a word cloud visualization from words in reddit comments that were gilded at least 10 times from the beginning of reddit to February 2019. The comment texts come from the reddit dataset created and maintained by Jason Baumgartner (/u/Stuck_In_the_Matrix). The data can be queried via a web service on Pushshift.io or Felipe Hoffa's (/u/fhoffa) dataset made available on Google's Big Query service. I used the latter option running the following query, that processed 955 GB when run 2019-04-28.
SELECT author, body, gilded FROM `fh-bigquery.reddit_comments.20*` WHERE gilded > 9
It took only a few seconds, which is impressive considering the dataset comprises several billion records, but be aware that querying the whole reddit comment dataset can become expensive quickly.
In the code cell below, the Python libraries used in this notebook are imported, some variables set and the dataset loaded into a pandas DataFrame.
Update
I added code to work around the issue that the spaCy language model en_core_web_lg
(version 2.0.0) does not identify stopwords.
%matplotlib inline %load_ext signature import os import re import pandas as pd import matplotlib.pyplot as plt import spacy from collections import Counter from imageio import imread from wordcloud import WordCloud font = '/usr/share/fonts/truetype/ubuntu-font-family/Ubuntu-B.ttf' limit = 1000 mask = imread('img/reddit-alien-large-mask.png') re_non_word = re.compile(r'[\W\s]') nlp = spacy.load('en_core_web_lg') # Workaround for issue with missing stopwords, see # https://github.com/explosion/spaCy/issues/1574#issuecomment-346184948 for word in nlp.Defaults.stop_words: nlp.vocab[word].is_stop = True df = pd.read_csv(os.path.expanduser('~/data/reddit/reddit_comments_gilded_min_10_20190428.csv')) num_comments = len(df.body) num_comments
523
In the next step the comment texts are tokenized using the spaCy library for natural language processing. Note that an English language model is used for all comments. Empty tokens and tokens recognized as stop words or which contain non-word characters except whitespace will be ignored.
Tokens identified as adverbs, interjections, nouns, verbs are lemmatized, i. e. the dictionary form is used not the inflected form. For example is and are are represented as a form of be. Adjectives and pronouns are lower-cased except for I and proper names upper-cased, so they can easily be distinguished from homographs with a different part of speech.
names = [] words = [] for idx, text in df.body.items(): doc = nlp(text) for token in doc: if token.is_stop or not len(token.text) or re.search(re_non_word, token.text): continue if token.pos_ in ('ADV', 'INTJ', 'NOUN', 'VERB'): words.append(token.lemma_) elif token.pos_ in ('ADJ', 'PRON'): if token.text in ('i', 'I'): words.append(token.text.upper()) else: words.append(token.text.lower()) elif token.pos_ == 'PROPN': name = token.text.upper() names.append(name) words.append(name)
Next we create a Counter
object from the standard library collections module passing it the list of words to be considered for the word cloud. Then a WordCloud object, setting the mask
parameter to a black image of Snoo, reddit's alien mascot. That object's fit_words
method then gets called with the Counter
object.
freq = Counter(words) wc = WordCloud( max_words=limit, mask=mask, background_color='#ffffff', font_path=font).fit_words(freq)
Finally, we use matplotlib to display the word cloud image in this notebook and add some explanatory text and credits to it.
annotation = '''The {} most frequent words in {} reddit comments gilded 10 or more times until February 2019. Adverbs, interjections, nouns and verbs were lemmatized. Proper names and "I" were upper-cased all other words lower-cased. Author: Ramiro Gómez - ramiro.org Data: Jason Baumgartner & Felipe Hoffa - reddit.com'''.format(limit, num_comments) fig = plt.figure() fig.set_figwidth(18) fig.set_figheight(24) plt.imshow(wc, interpolation='bilinear') plt.annotate(annotation, xy=(0, 60), fontsize=12) plt.axis('off') plt.savefig('img/reddit-gilded-comments-wordcloud.png', bbox_inches='tight')
If you search for data visualization experts opinions on word clouds you'll probably come to the conclusion not to use them. There are certainly better ways to show frequencies, but I do think they have an artistic appeal and that's why I created this one.
That being said, if you are curious about which proper names occurred most often, check out the plain text output below. You can see that EDIT is frequently identified as a proper name. It is often used by redditors to show they edited the comment and to explain why or just to say thanks for the gold.
Counter(names).most_common(10)
[('EDIT', 289), ('TRUMP', 275), ('REDDIT', 158), ('US', 98), ('AMERICA', 94), ('RUSSIA', 72), ('CLINTON', 71), ('PUTIN', 53), ('TPP', 49), ('ISRAEL', 49)]
%signature
Author: Ramiro Gómez • Last edited: May 05, 2019Linux 5.0.0-13-generic - CPython 3.6.7 - IPython 7.5.0 - matplotlib 3.0.3 - numpy 1.14.3 - pandas 0.24.2
Shirts for Python Programmers
Published:May 03, 2019 by Ramiro Gómez. If you want to be notified about new content, click here to subscribe to the newsletter or RSS feed.
Disclosure: External links on this page may contain affiliate IDs, which means that I earn a commission if you make a purchase via such a link. This allows me to operate this site and offer hopefully valuable content that is freely accessible. More information about affiliate programs.
© Ramiro Gómez. Berlin, Germany.
Be informed about new content
Share this page
Hit ESC or click X on top right to close this dialog.