Text Mining – Sentiment Analysis on Autonomous Driving

textmining

Hello All! 

In this blog I will walk you through doing sentiment analysis through Twitter data using R. Sentiment analysis is the study of people’s views on a topic through their participation in social media and how those sentiments affect a related object or phenomenon. Some examples of sentiment analysis include: studying Twitter users’ comments on the next presidential candidate and how those users’ thoughts would affect the stock market should that candidate becomes the president; looking into customers’ reviews about a specific product in order to gauge future demand and sales of that product. The application of sentiment analysis is endless and it helps to gain a good understanding of the topic in this data-driven world. 

Before I begin, I want to thank my friend Huy Q. Dang, who attended the same graduate program with me, for the inspirations of this blog. Earlier Huy had a discussion with me regarding the possibility of starting a research project that utilizes sentiment analysis to predict stock market movement. His suggestions triggered my motivations to learn about sentiment analysis. After reading some tutorials and watching instructive videos online, I am now comfortable running the analysis using R program. I want to show you here that you can do it too. 🙂

Required tool: R or R Studio. Optional tool: Notepad or any txt-format reader.

First of all run the following codes to install and call out the “curl” package:              

> install.packages(“curl”)
> require(curl)                                                        

*Check out this website for what “curl” package does: Curl Package

Next, install the following packages inside parenthesis first and then call out the packages using the “library” functions: 

library(twitteR)
library(ROAuth)
library(stringr)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(plyr)
library(ggplot2)
> library(scales)

If your R Studio version is 3.0 or later, you won’t be able to install the “sentiment” package via the “install.packages” command. Instead, you’ll have to install “sentiment” as a plug-in through the “tm” package.See the codes below:

install.packages(“tm.lexicon.GeneralInquirer”, repos=”http://datacube.wu.ac.at”, type=”source”)
library(tm.lexicon.GeneralInquirer)
install.packages(“tm.plugin.sentiment”, repos=”http://R-Forge.R-project.org”)
library(tm.plugin.sentiment)

Now we have all the packages required to run sentiment analysis. But before we can start, we have to give R the authorization to access our Twitter account. To do that, we have to register an application (API) at https://apps.twitter.com/.

Once done registering, look at the values of API key (also known as “consumer key”), secret and token.

Save the Twitter back-end information into new objects for easy reference later:

> api_key <- “your API key from twitter”
> api_secret <- “your Secret key from twitter”
> access_token <- “you Access Token from twitter”
> access_token_secret <- “you Access Token Secret key from twitter”
> setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Let’s begin coding for sentiment analysis now:                                              

> autonomous_tweets = searchTwitter(“driverless”, n=2000, lang=”en”)                      

Then fetch text from tweets that contain the word “driverless”:                        

> autonomous_txt = sapply(autonomous_tweets, function(x) x$getText()) 

If you run “autonomous_txt” now, you will get the 2000 mining results. For simplicity, I have only included the last 3 outputs generated by the code:                                                

output2

We should then remove retweet titles from the stored tweets:                      

> autonomous_txt = gsub(“(RT|via)((?:\\b\\W*@\\w+)+)”, “”, autonomous_txt)

Remove email addresses from tweets extracted:                                                        

> autonomous_txt = gsub(“@\\w+”, “”, autonomous_txt)

Remove punctuations:                                                                                                    

> autonomous_txt = gsub(“[[:punct:]]”, “”, autonomous_txt)

Because we are only interested in mining the texts, we should remove numbers as well:      

> autonomous_txt = gsub(“[[:digit:]]”, “”, autonomous_txt)

Also remove website links from the tweets extracted:                                      

> autonomous_txt = gsub(“http\\w+”, “”, autonomous_txt)

Trim texts by removing unnecessary spaces:                                                                

> autonomous_txt = gsub(“[ \t]{2,}”, “”, autonomous_txt) #Removes spaces
> autonomous_txt = gsub(“^\\s+|\\s+$”, “”, autonomous_txt) #Removes tabs

Remove NA terms and column headers:                                                                    

> autonomous_txt = autonomous_txt[!is.na(autonomous_txt)]
> names(autonomous_txt) = NULL

Now that we have the prerequisite codes nailed down, we can analyze our tweet results using a polarity tool which tries to classify statements into “positive”, “negative” and “neutral”. There are many ways you can achieve this. If you happen to have R version older than 3.2.2, you can do so using the “classify_polarity” function. The code would be:

> autonomous_pol <- classify_polarity(autonomous_txt, algorithm=”bayes”)

However, if you have R version 3.2.2 or newer like myself and many other R users do, “classify_polarity” function won’t be available and you’ll have to use the “score” function to achieve the same goal.

First, aggregate the tweets into a nicely formatted style using the “Corpus” function:          

> text.corpus <- Corpus(VectorSource(autonomous_txt))

Next thing you should do is to rename “tm_term_score” to “tm_tag_score” so that the “score” function in newer R versions can recognize your underlying intention with the function:                                                                                                                            

>tm_tag_score<-tm_term_score

>install.packages(“devtools”)

>library(devtools)

>install_github(“mannau/tm.plugin.sentiment”)

Good to go now! You can type the score function as follow:  

>install.packages(“score”)

>library(score)

> text.corpus <- score(text.corpus)

Now you can run the “meta” function to see the sentiment of each tweet extracted:              

> meta(text.corpus) 

Due to large number of output (2000 rows), I will only list the output header row and the last 6 observations here. The important column to look at is “polarity”, with 0 being neutral sentiment, 0.3333 being slightly positive, -0.3333 slightly negative, 1 being positive and -1 being negative. Polarity is determined by the difference between “pos_refs_per_ref” (positive sentiment reference) and “neg_refs_per_ref” (negative sentiment reference). Obviously, if a tweet demonstrates equal number for these 2 fields, polarity would be 0 (neutral). If you wish to look a little deeper into the picture, the higher the “pos_refs_per_ref” and “neg_refs_per_ref” value, the stronger is the sentiment, be it positive or negative. 

Output:    

output

It would be helpful to look at the frequency distribution of polarity through a histogram. Because corpus-format data are in character style, they are not well suited for numeric operations such as histogram. You can; however, save the corpus data as txt format and do some editing to keep just the polarity column. The codes to save corpus data into txt file is as follows:  

> ncorpus <- meta(text.corpus)
> writeLines(as.character(ncorpus), con=”corpus.txt”)

Open the “mycorpus.txt” file you just saved and remove all “c(” and “)” characters as well as trailing spaces at the end of each line. Save the file again. You can now load the txt file back into R for further analysis:

> sentiment_result <- read.table(“C:/Users/Shaolun/Desktop/Wordpress blogs/Sentiment analysis/newcorpus.txt”, sep=”,”) 

Run the following code to stack the values vertically for easy analysis:  

> sentiment_result <- stack(sentiment_result) 

You can also run the following codes to retrieve unique set of values for “polarity”:  

> uni <- unique(sentiment_result[, “values”])
> uni

[1] 0.0000000 1.0000000 0.3333333 -1.0000000 -0.3333333 -0.5000000 0.5000000 0.6000000
[9] -0.6666667 -0.2000000

Run the following codes to generate a nice histogram for polarity:      

> ggplot(sentiment_result,aes(sentiment_result$values)) + geom_histogram(col=”orange”,fill=”green”,alpha=0.2) + xlab(“Polarity”) + ylab(“Count”) + ggtitle(“Histogram for Polarity”) + scale_y_continuous(breaks=pretty_breaks(n=10),limits=c(0,1000))

histogram3

As you can see from the resulted histogram, most of the tweets are either strongly positive (1.0. About 46%) or neutral (0.0. About 41%) about autonomous driving, with only a small number of users having strongly negative views.   

Another good practice is to convert all word forms to present tense so that the words displayed in word cloud can be consistent. The “stemDocument” argument does just that. Another good idea is to convert all text output to lower case so that there’s consistency and you could easily remove unwanted words from the word cloud later. Below are the codes to achieve those 2 objectives:

> newcorpus <- tm_map(text.corpus, stemDocument)
> nncorpus <- tm_map(newcorpus, content_transformer(tolower))

We don’t want words such as “the”, “driverless”, “car”, “driverlesscar”, “vehicle”, “this” and “what” to appear in the word cloud because they are redundant and doesn’t have any meaning in this case. Don’t worry too much in the beginning though. You can always remove additional words from the cloud once you’ve plotted it and found more unwanted ones. So let’s remove these words by running the following code:

> nncorpus <- tm_map(nncorpus, removeWords,
c(“selfdriving”,”automaker”,”drive”,”driverless”,
“car”,”driverlesscar”,”to”,”the”,”this”,”that”,
“just”,”vehicle”,”our”,”me”,”you”,”have”,”not”,
“are”,”get”,”from”,”and”,”for”,”i”,”may”,”with”,
“which”,”what”,”how”))

Here comes the final part! Run the following codes to get the beautiful word cloud that contains no more than 70 words:                          

> wordcloud(nncorpus,scale = c(3,.5), max.words=70, random.order = FALSE, colors=brewer.pal(12,”Paired”))

wordcloud

There you go. As you might’ve expected, most of the words generated are either car or technology-related, with “michigan”, “future”, “fresh” and “startup” being some of the most common words. A few interesting adverbs and adjectives have also made a surprise appearance (“beyond”, “inspirational”, “greatest”) and showed stronger emotions than other words in this cloud. Overall, the views tend to be either positive or neutral towards driverless technology, thus confirming what we see earlier from the histogram.  

That’s it for sentiment analysis! I hope you would get a better understanding of this text mining method using R. Please contact me if you have any question. 🙂

2 thoughts on “Text Mining – Sentiment Analysis on Autonomous Driving

  1. Stoping by your blog can always get some useful stuff. Yes. Sentiment analysis currently is paid attention by a lot of areas, such as communication, marketing, or sociology. From this post, it seems that R can do a really good job in data mining of sentiment analysis on social media platforms. I am looking forward to your continued post. 🙂

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s