by Russ Ward (@russcward)
As the resident research and analytics guy here at Zemoga, I was asked by Briana Campbell, our Head of Social Media, to contribute to a conversation about the use of “sentiment” in social media monitoring. Apparently the notion of sarcasm or the parochial use of words like “wicked,” “bad,” or “beast” in a positive context had caused some of her peers to dispatch “sentiment analysis” as flat out voodoo.
While being less than perfect as a quantitative measure in real world analysis (even a bunch of humans will disagree), I am an advocate of the use of “sentiment” data as an indicator of the feelings of a specific population. Today complex algorithms process text data to determine sentiment and so it is far from voodoo or smoke and mirrors. No folks, “Sentiment Analysis” is not Voodoo!
To clarify the subject, and my position, I would like to take a few steps back to consider some of the scientific research that has been done in this area in recent years. There is no doubt that this subject has significant potential to be a serious and valuable indicator of the emotion of a given population many academic research projects have been in progress to verify its efficacy. We see entire countries have their political regimes overthrown through communications within social media.
The advent of computer-mediated communication (CMC) (Reference 1) enables large populations of people to be part of conversations regardless of geography and social class of the participants. These conversations can be monitored and analyzed as large-scale ethnographic and psychosocial communities through either a manual or automated analytical methodologies. We know that social media monitoring tools can monitor all manner of open access web based communications from Facebook, MySpace, Twitter, SMS, instant messages and blogs to name a few.
There is a great deal of power and advantage in understanding the emotional state of a population regarding just about anything from fashion to sports to politics.
In 2010 Schweitzer and Garcia (Ref 2) wrote a compelling paper about their mathematical model for the assessment of emotion in online communities. This model, while appearing complex, provides the framework to map out not only positive and negative emotion, but also the arousal factors of motivation within online community communication.
Their model is based on Brownian agent theory that can be used to describe and classify the state of variables within a population that is summarized as a whole.
In the following images, some small part of the mathematical theory behind the Brownian agents is shown from the Schweitzer and Garcia paper. The causation model in Figure 1. describes how the sentiment in social expression s is determined by the valence v (the direction the individual is thinking positive or negative) and their arousal a (motivation) to express a comment.
This sequential series of equations reveal the factors which contribute to the determination of sentiment and motivational valence.
According to the authors, this model provides its users with the ability to predict the resulting collective emotions of the online communities for which it is being used.
In another 2010 paper by Thelwall et al. (Ref 3) the authors describe an new algorithm named “SentiStrength” to analyze and identify sentiment from informal English text. These results are shown to yield a 60.6% accuracy for mapping positive sentiment and a 72.8% accuracy for negative sentiment within these short informal texts.
SentiStrength exploits contemporary cultural and social tendencies of slang, de-facto grammar and spelling styles in cyberspace. This is done via lookup tables programmed either manually or through machine learning.
This scientific research places sentiment analysis utilizing validated processing algorithms such as these in the realm of meaningful web metrics data that cannot be tossed aside as quackery or voodoo.
Continued research will gradually improve the accuracy results over time. However, in the mean time, I advocate that sentiment analysis is very useful as an indicator of the emotional status within social media communiations.
If the negative sentiment for a subject is greater than 5 to 10%, then closer investigation is wise to find the root causes (which will be covered about in the next installment – ed.).
Another way to look at this is as such: there is a 10% negative sentiment in a sample of 100 people; this means that 10 people had a negative feeling about the subject. If this is accurate to 70% using the SentiStrength software, then it means 7 people really do have a negative emotional outlook about their experience. In this scenario should we discard the feelings of 7% of the population as insignificant?
So here is my question:
If sentiment analysis is this scientific in its behind the scenes processing – why is this level of accuracy not useful to information users and what level of accuracy will it take if human processing is about the same?
Reference 1: [http://en.wikipedia.org/wiki/Computer-mediated\communication](http://en.wikipedia.org/wiki/Computer-mediated_communication)_
2. Schweitzer & Garcia An agent based model of collective emotions in online communities. 2010 European Physical
Journal B 77 533 – 545
3. M. Thelwall, K. Buckley, G. Paltoglou & D. Cai: Sentiment Strength Detection in Short Informal Text. Journal of the
American Society for Information Science and Technology, 61(12): 2544-2558, 2010