In my previous post I introduced CoreNLP as a viable solution for sentiment analysis. Today, I will show you how to use the library in practice. We will start with understanding the data model and break down the algorithm for calculating the tweet sentiment.

Objectives

Our objective is to assess an emotional load, a sentiment, of stock market tweets. Investors comment on market trends, share their expectations and frustrations. At scale, when tapping into a stream of tweets, it would be interesting to see how the reactions correlate with the actual price changes of the stocks.

Today, we only focus on capturing the sentiment.

Modelling a Tweet

A tweet is not just a piece of plain text. It carries additional information in the form of hashtags, emojis, mentions and retweets. There is a lot to consider when extracting information from a tweet in a real life scenario.

For the purpose of this tutorial I will focus on the main attributes, e.g. the tweeted message and its hashtags.

Analysing a tweet sentiment with CoreNLP

Looking at the tweet message, I am sure you must wonder what on earth is positive on news about corona outbreak. Well, sentiment analysis is not a trivial task after all. For example, the hashtags as, essentially, standalone words are more impactful. Taking out the obvious suspects (#coronavirus, #COVID-19 and #decline) would turn it into a perfectly neutral statement. However, adding any of these terms makes the sentiment slide onto the negative side of the scale. All in all, the results won’t be perfectly accurate, but they suffice for the purpose of this tutorial.

In a nutshell, a tweet is broken down into sentences. How to correctly identify a sentence in a tweet in the first place? That’s a very good question. It’s configurable too and I will get to it one of the follow-up posts. Each sentence is then analysed for sentiment and CoreNLP provides a detailed score on a scale of emotions.

The overall tweet score is a thin layer of business logic you, as a developer, have to provide. That’s what the next section is about.

Weighted Sentiments

Typically, the longest sentence determines the sentiment of the entire tweet. This approach is however applicable to any piece of text. I feel that hashtags carry quite a lot of weight in themselves. We use them to highlight keywords and set the tone of the message (as well as making it searchable of course ;-).

Therefore, I determine the weight of a sentence as follows.

sentence weight = sentence length + the total length of the hashtags found in the sentence

Essentially, each keyword doubles in weight.

Hashtags Carry Emotions Too

More often than not, a tweet ends with a number of hashtags. I think it’s fair to treat this sequence as a standalone sentence. This sentence’s weight will be doubled as each of the words is a keyword. Yes, so if there is an emotionally loaded hashtag used multiple times throughout the tweet, it will dramatically affect the sentiment of the whole tweet. I think it’s a fair assumption for starters. I might adjust the algorithm as I go, but that’s what I am going to use for now.

Overall Tweet Sentiment

Once we know the weighted sentiment of each of the sentences (including hashtags), we are ready to determine the sentiment of the entire tweet. This is done by simply extracting the sentence with the maximum weight. Its sentiment equals the sentiment of the tweet. Of course, there could be several sentences with the same (max) weight. In this case we calculate the resulting sentiment as a weighted average of their sentiments.

The Complete Example

If you are like me, you must have been looking for the actual implementation. Without further ado, here it is. Hope it makes sense. If not, just ask a question in the comment section below and I will do my best to give you a comprehensive answer in a timely manner.

Summary

In this post, we’ve established how to represent a tweet and reviewed the algorithm of determining the overall tweet sentiment. What I have not covered is the actual algorithm the library uses to determine the sentiment of a piece of text. I encourage you to go through StanfordNLP sentiment section of their website.

Thanks for reading thus far and stay tuned for the next part of this tutorial where we look at data collection and cleansing.


Tomas Zezula

Hello! I'm a technology enthusiast with a knack for solving problems and a passion for making complex concepts accessible. My journey spans across software development, project management, and technical writing. I specialise in transforming rough sketches of ideas to fully launched products, all the while breaking down complex processes into understandable language. I believe a well-designed software development process is key to driving business growth. My focus as a leader and technical writer aims to bridge the tech-business divide, ensuring that intricate concepts are available and understandable to all. As a consultant, I'm eager to bring my versatile skills and extensive experience to help businesses navigate their software integration needs. Whether you're seeking bespoke software solutions, well-coordinated product launches, or easily digestible tech content, I'm here to make it happen. Ready to turn your vision into reality? Let's connect and explore the possibilities together.