For those of you who have been following my blog consistently, you may have recalled that sometime in 2016, I had written an article on Sentiment Analysis with R using Twitter data (link). If you did, I hope you enjoyed reading it. Nowadays, it is hard to argue against the fact that Python is quickly gaining steams as one of the top programming language for data professionals, at the expense of R. Due to this inevitable trend towards Python, I want to share with you guys a simple Python script that serves the same purpose – sentiment analysis. Specifically, this Python script takes some positive, negative and neutral vocabularies as training data, and uses these vocabularies as the basis to train the Naive Bayes machine learning algorithm to predict the sentiment of sentences. There are 2 reasons why I took this simpler approach of sentiment analysis instead of using Twitter data: 1) Some of my readers might have very little experience running sentiment analysis using Python. The approach in this article would help them to get familiarized with the transition in a highly simple manner; 2) Sentiment analysis on movie review data or other commercial data are quickly gaining popularity. Getting readers started using simple vocabularies as training data to test complete sentences serves as a good foundation to acquiring that skills at a later time.
The Python script can be accessed through the link I provided below:
Below are what each block of the script does in a nutshell:
- First of all, import all the required packages, all nltk-related, so that Python can run the NLP algorithms whenever the user is ready.
- Pass in some training vocabularies in the positive, neutral, and negative categories so that the algorithm could be trained.
- Create the ‘word_feats’ function which assigns the word ‘True’ to each word in the testing sentence that we will create later, and wrap those key value pairs into dictionaries because that’s the format that works for the classification algorithm.
- Create word features for each of the 3 categories that contains the key value pair dictionaries created in Step 3, and add the word ‘pos’, ‘neg’ and ‘neu’ to the dictionaries that correspond to that sentiment, and wrap the whole thing into a list.
- Add all 3 feature sets created in Step 4 and call it the ‘train_set’. Run the Naive Baynes classifier algorithm to train the ‘train_set’.
- Write the sentence “Awesome movie. I like it. It is so bad.” to test how effective the algorithm is based on the training vocabularies you used in Step 2.
- Split the sentence on period into separate sentences. Remove the stop words so they don’t confuse the algorithm. Split these sentences again on white spaces so that they become individual words. Note that we removed the stop words in this case due to 2 reasons: 1) The training data are all vocabularies and do not contain any sentence (sentences might include word combinations with reverse meanings (i.e. not great, not bad) in which case including stop words like ‘not’ in the testing data set will be wise. 2) The testing data set, in this case the long sentence ‘Awesome movie. I like it. It is so bad.’ does not contain any components with reverse meanings so including stop words in this case would only slow down the algorithm and not add any value.
- Run the Naive Baynes classification on the sentence you passed in on Step 6.
- Print the output from the classification model:awesome movie –> pos
awesome movie: 1 vs -0
i like it –> pos
i like it: 1 vs -0
it is so bad –> neg
it is so bad: 0 vs -1
Examine the output. As you can see, all three simple sub-sentences had been classified correctly based on their sentiments. ‘Awesome movie’ is classified as positive. ‘I like it’ is classified as positive. ‘It is so bad’ is classified as negative. What really helped is the removal of stop words. Had we not done that in the script, the classification algorithm might’ve classified ‘it is so bad’ as positive because the words ‘it’, ‘is’ and ‘so’ are all stop words and these words will be classified as neutral unless otherwise specified. That will put 3 neutral words against 1 negative word (‘bad’), and thus could very well change the sentiment output to neutral or positive. So, remember to remove stop words when you do sentiment analysis and your model will become more refined and accurate.