The term text analytics sounds highly technical. Yet in articles and discussions about this field, the two forms of output that get the most mentions are little more than forms of counting. What are these outputs and why do they pass the cut as “analytical” in connection with text? Let’s start with the output, and then move on to the reasons.
Perhaps the single type of image that represents text analytics is an image showing words of various sizes stacked and piled into a shape. Most often this is a simple rectangle, although some can be strikingly free-form. This display is sometimes called a word cloud generator and sometimes a wordle.
For all its visual appeal, this display simply shows the relative frequencies of words in a document. The bigger words are more prevalent, smaller ones less so. While this is often represented as a slick analytical feat, the only good trick is the computer’s skill at fitting the words together.
Additional computer processing takes place in the background, starting with the removal of so-called stop words. These are the smaller, frequent words that typically do not bear the meaning of a document, but rather serve to stitch it together. In the last sentence, stop words would include “these,” “are,” “the,” “that,” “but,” “to,” and so on. The computer eliminates these by consulting a dictionary. This often is a simple text file that the user can modify. Wordles typically also ask the computer to do stemming, the process of removing tenses and regularizing endings so that different forms of the same word do not get counted as two distinct entities.
Some programs go further and perform lemmatization, which tries to recognize the part of speech of a word, so that (for instance), “busy” (the verb) does not get lumped in with “business” (the noun) or “busy” (the adjective). However, stop word removal, stemming and even lemmatization are computational problems, not analytical ones. From your perspective, once you just ask for a wordle, one quickly appears. The computer may work hard, but you need not do any analytical thinking.
Another prevalent form of output which is not quite analytical is called sentiment analysis. As in the removal of stop words, this involves the use of a dictionary. And as in the creation of the wordle, it is centered on counting. The analysis typically refers to dictionaries of positive and negative words and phrases. Looking at a document or set of documents, the computer counts the occurrences of each type. The balance of plus and minus gives the “sentiment score.” Dictionaries of sentiment words can be found online, and even downloaded for use. You can easily find one online with about 2000 positive phrases and about 6000 negative ones.
Once again, the computational challenges can be ferocious. For instance, would make a mistake about sentiment if you just counted positive and negative in this sentence: “My new SoggyOs breakfast food is so good, it makes Sorghum Sweeties taste bad by comparison, and Kardboard Krunchies just terrible.” There are two negatives and one positive, so a simple count would peg this sentence as negative.
Yet clearly, the writer is singing the praises of his/her new breakfast-like substance. Indeed, unusual locutions, neologisms, slang, sarcasm and awful sentence structure often prove highly confusing to computer. But resolving these requires better algorithms and faster machines, not more thought from the analyst.
We can expect counting and enumeration to continue getting more accurate. But still, these will remain counting. What are the reasons for this? Stayed tuned for the next installment where we discuss probable causes.
About the Author: Dr. Steven Struhl has been involved in marketing science, statistics and psychology for 30 years. Before founding Converge Analytic, he was Sr. Vice President at Total Research/ Harris Interactive for 15 years, and earlier served as director of market analytics and new product development at statistical software maker SPSS.