In my last posting on text mining, I described how to collect data from Twitter. In this post, I will describe how we can summarize a large set of tweets on a certain topic - for example the latest SITE conference.
Background: Giving structure to your data
Text data, such as tweets, comments or posts usually comes with limited structure, as compared to scores on likert scales. To visualize and quantify the data we have to give it structure in the first place. Suppose we have a character vector as the following:
 "I am a member of the XYZ association"
 "Please apply for our open position"
 "The XYZ memorial lecture takes place on wednesday"
 "Vote for the most popular lecturer!"
What is a character vector? You can think of a character vector as a container of all text pieces. Each piece represents the text from an individual, and is assigned a number. You can access any piece by using its given number. This type of data is easy for humans to read, but not for machines. Machine prefers the same information structured in the following way:
A text file structured in this way is called document-term matrix. Each row in the matrix represents a word, while each column represents a document, which refers to all the texts from an individual. Each element in the matrix represents the number of times a particular word appears in a particular document. You may have noticed that all texts have been converted to lowercase in this matrix, while some words, like “a” or “the” are not shown up in the matrix.
To convert the tweet texts you collect into a document-term matrix, the following steps are usually necessary:
- Remove nonsense characters
- Convert all words to lowercase.
- Remove stop words, such as “a”, “an”, “that” and “the”.
Sample Data - Tweets on #siteconf
Did you miss your favorite AACE conference? Would you like to find out what predominant topics people discussed? We collected 709 tweets using the hashtags "#siteconf".
Step 1: Word Clouds
To take a quick look at our data, an initial visual representation with world clouds is helpful.
Step 2: Cluster Tree
A more structured way to explore the data in an associational sense is to look at the collection of terms that frequently co-occur. This method is called cluster analysis.
Cluster analysis is a way of finding association between items and bind nearby items into groups. A typical visualization technique is a tree diagram called dendrogram. The most common cluster analysis include K-means clustering and hierarchical clustering. K-means clustering require you to specify how many groups you prefer to have in the result before the analysis, while hierarchical clustering doesn’t have this requirement.
The density and shape of the dendrogram may vary depending on the sparsity. The above one is the dendrogram on sparsity .95. It is interesting that when people tweeted using the hashtag “#msueped”, they also tended to use “#site2015”. “#msueped” stands for Educational Psychology and Educational Technology from Michigan State University. You can tell that many people from this program went to SITE 2015 conference.
Did you gain a sense what the SITE community is talking about? Data visualization is certainly helpful to make sense of large datasets as it allows you to gain an overview from an elevated perspective. However, don’t mistake a set of images for the real thing. If you attended SITE 2015 in Las Vegas, your first hand experience is likely to be totally different and certainly more in-depth. Also keep in mind that while social media is becoming ever more popular, Twitter users are still only a sub-group of the whole audience.
No approach is neutral in its analysis: Understanding the tools that we use helps us to interpret seemingly obvious connections more carefully. If you want to explore how we produced these visualizations use our sample data set with instructions.