A basic word cloud is a typically the starting point for text analysis. A word cloud is a simple visualization—generated by tools such as Voyant, Paper Machines, and Wordle—where one can can view the most frequently used words in a text or group of texts. The size of the word in the word cloud is directly proportional to the number of times a word is used in a given corpus. So the larger the word in the illustration, the more frequently the word is used in the texts from which the word cloud was generated.
There are many types of word clouds you can create (I personally prefer the more colorful, dense clouds created using Paper Machines) and each looks pretty good. You can easily whip up a word cloud for any given text and post it as a nice image to accompany your writing. But this is where many tend to criticize word clouds. Sure, they look great, but just how valuable are word clouds for analysis? What can you do with word clouds that contributes to your argument rather than just a cool-looking but methodologically worthless accompanying image?
One of the main criticisms with word clouds is that you lose all context surrounding the words in your corpus. Sure we can see if a document uses a word, and how many times the word is used, but word clouds cannot tell us how a word is used. One can tell that a word is used frequently, but one cannot tell if the speaker or author is referring to the word positively or negatively. For example, an economic history might frequently reference Marx or marxist critiques, but that does not mean that the history is a Marxist economic history. It might be the opposite. Every mention of “Marx” might be accompanied by a criticism, but in a word cloud we cannot see that criticism. The problem with a basic word cloud is that you cannot know for sure why a word is used so frequently without looking into the context.
But this does not mean word clouds are worthless. If one considers word clouds carefully, paying attention to both word frequencies and contextual data, one can pull some interesting conclusions from them. Particularly when pairing word clouds with some of the other data that Voyant and Paper Machines provide, we can use word clouds as an effective starting point for deeper text investigation. Moreover, one can use word clouds of two similar corpora (such as the Globe Stories and Public Submissions) as a comparative tool to spark further inquiry.
Consider the following word cloud, for example. This cloud was generated with Paper Machines, using both the Globe Stories and the Public Submissions, removing the most common English stop words. It gives you an idea of the words used in the entire corpus:
At first glance, you might notice quite a bit of obvious words. These stories are all about experiences of the Boston Marathon bombing, so many of the most frequently used words (i.e. Boston, marathon, people, finish, line runners, running, Boylston, police, bombs) are exactly what we would expect from a word cloud generated from this corpus. However, on closer inspection and paired with a comparative analysis of word clouds generated from the Globe Stories and Public Submissions separately, one can begin to investigate some interesting characteristics. I want to first draw your attention to the various ways (and frequencies) that people described the bombing itself.
If you take a look at the terms I have circled in this word cloud, I have highlighted the various ways people described and categorized the bombings. The stories use a variety of terms: bomb, bombs, blast, explosion, explosions, smoke, and bombing. Now in a given corpus of stories, there some variety in the words people use to describe the explosions is expected (in fact, the past couple paragraphs, I have used quite a few different words!). However, when one splits up the word cloud of all stories to individual word clouds for Globe Stories and Public Submissions, one can see that the variety is actually corpus-specific:
At first glance, these two word clouds might look very similar. But if you take a look at the relative uses of words such as bomb and explosion, you will notice that most of these words are used much more frequently in the Globe Stories than in the Public Submissions. Words such as “bomb,” “bombs,” and “explosion” are utilized much less in the Public Submissions than the Globe stories. However, one word jumps out in the Public Submissions compared to the Globe Stories: “bombing.” It would appear that the Public Submissions refer to the explosions themselves less than the Globe Stories, and when they do mention the explosions, they tend to describe them as a “bombing.” But this is as far as we can explore using only the word clouds. Any further investigation requires a deeper look into the trends and discrepancies in word frequencies between the two corpora.
Using Voyant, and double checking the results by a manual search of the text files, I compiled a list of word uses for concerning the bombing: “explosion”, “explosions”, “bomb”, “bombs”, “bombing”, “bombings”, “bombers”, “smoke”, “blast”, and “blasts.” I then calculated the overall frequency of each word’s use in the entire collection of stories. Below is a spreadsheet depicting the data:
You will notice that words such as bomb(184), explosion(167), blast(134), smoke(130), explosions(87), and bombs(78) appear frequently in the entire corpus compared to terms like bombing(39), bombings(17), and bombers(7). If one were to stop here, one might think that these usage frequencies might just indicate a preference for certain types of words to describe the explosions. One might assume that these usage statistics were characteristic of the entire corpus (both Globe Stories and Public Submissions), but if you consider the Globe Stories and Public Submissions separately, using the same words for comparison, one can realize a distinct difference in these two sub-corpora. First, we will consider the Globe Stories:
You may first notice that there is an additional column in this spreadsheet of the Globe Stories. In this section I wanted to get an idea not only of the Raw Frequency, but also to get a sense of what percentage the raw frequency of words in the Globe Stories were compared to all the uses of the word in the entire corpus (“Percentage of Total Usage”). For all but three (bombing, bombings, bombers) of these selected terms referring to the explosions, a significant majority (~70% or more) of the occurrences were in these Globe Stories. For example, 147 out of a total 167 (~88.02%) uses of the word “explosion” originated from the Globes Stories. One can observe a similar trend in the uses of “explosions,” “bomb,” “bombs,” “smoke,” “blast,” and “blasts.” The Public Submissions, on the other hand, reveal the reverse of this trend:
These terms are not only used less frequently in the Public Submissions than the Globe Stories, but these submissions are much more likely to use the words “bombing” and “bombings” to describe the attacks than the rest of the corpus. This is not to say that the Public Submissions do not use the other words. For example, “bombing” is used just as frequently in the Public Submissions as “bomb.” However, when you consider that the thirty-two instances of “bombing” correspond with 82.05% of the total uses of the word in all stories, whereas “bomb” corresponds with only 17.39% of total use, this appears to be a much more significant difference in usage. This difference in usage can help us reveal a key difference in how people describe the attacks in the Globe Stories versus the Public Submissions. First, why do the Globe Stories tend to use different words more frequently to describe the explosions? Why do the Public Submissions have the tendency to use words such as “bombing” and “bombings” to describe the Marathon bombings instead of words such as “explosion(s),” “bomb(s),” and “blast(s)”?
Before moving on from looking at these word clouds, I also want to take a moment to consider the use of the word “bombers” in the entire corpus. In 347 total stories, the word “bombers” were only used seven times. Moreover, each of these seven occurrences were from the Public Submissions. None of the Globe Stories even mentioned the “bombers.” I think it is very interesting that, particularly in the Globe Stories where we have seen an abundance of references to the explosions or bombs themselves, the “bombers” were mentioned so infrequently in the stories. Is this characteristic of reflecting on traumatic events? When writing these stories, do people have the tendency to ignore the perpetrators? Furthermore, why would people writing these Public Submissions be more likely (comparatively) to mention the bombers in their stories than those who submitted their stories to the Globe?
I think all these questions from the previous two paragraphs are very important to consider. I do not believe that these questions are entirely answerable at this time given the limited corpus of text, however, I do believe they indicate a structural difference between how the Public Submissions and Globe Stories were crafted. This might have to do with the contribution method, or the time between the bombings and when the users wrote their stories.
I hope that this post has accomplished making a few important points. First, word clouds are not useless. They serve as excellent starting points for visualizing and beginning a text analysis. That being said, word clouds provoke questions, but do not answer them. Their value lies in the steps you take after sparking questions. Only after taking a deeper look at word frequencies was I able to solidify my findings and draw some preliminary conclusions. Second, using word clouds is particularly useful (again as a starting point) as a method of comparison between two related corpora. By generating and analyzing word clouds of corpora on the same topic one can quickly make connections and identify differences between corpora that might be lost if only considering word frequencies.
TABLE OF CONTENTS