The intense campaign battle between presidential candidates Donal Trump and Hillary Clinton have dominated the news media over the past few months. In particular, Trump’s provocative comments on various social issues have made him one of the most controversial candidates to run for U.S. president. There are even news that studied both candidates’ view on the tech industry: http://www.nbcnews.com/tech/tech-news/where-hillary-clinton-donald-trump-stand-tech-issues-n619736. In this blog I will integrate R and Twitter to perform sentiment analysis algorithm to see what Twitter users think about Trump.
If you’ve been following my site recently, you might’ve noticed my sentiment analysis blog on autonomous driving. If you haven’t, I encourage you to at least glance over on that blog as it would tell you the required packages to successfully run sentiment analysis with R. The data in that blog was based on 2,000 tweets that contain thoughts on driver-less technology. The current blog about Trump represents a small breakthrough in that it captures 10,000 tweets containing the word “Trump”; the increase in sample size would make the results in this blog arguably more reliable.
After you’ve installed the essential packages for Twitter connection and plot creations, you can then run codes similar to those in my “autonomous driving” blog. Below are some of the codes needed to run a sentiment analysis:
>install.packages(“tm.lexicon.GeneralInquirer”, repos=”http://datacube.wu.ac.at”, type=”source”)
> library(tm.lexicon.GeneralInquirer)
>install.packages(“tm.plugin.sentiment”, repos=”http://R-Forge.R-project.org”)
> library(tm.plugin.sentiment)
> trump_tweets <- searchTwitter(“trump”, n=10000, lang = “en”)
> trump_texts = sapply(trump_tweets, function(x) x$getText())
If you run “trump_texts” in R, these are some of the lines you would get as a result:
[15] “RT @telesurenglish: Warren Buffett challenges Donald Trump to reveal his tax return https://t.co/1t2ZNRcBRx”
[16] “Traitor @fhollande disgusted by @realDonaldTrump. Would be bad for Trump, except 80% French people would probably pay to spit on @fhollande.”
[17] “RT @DuncanHosie: Meg Whitman: Trump is a \”dangerous\” and \”dishonest demagogue.\” All other Republicans have a moral obligation to similarly…”
[18] “RT @juliussharpe: The Trump campaign reminds me of how I used to try to get people to dump me so I wouldn’t have to break up with them.”
[19] “Donald Trump’s Self-Created Pothole – Washington Wire – WSJ https://t.co/A3HkKFCsCS”
[20] “RT @JeffreyGoldberg: This is amazing. \”Republican lawmakers and strategists have begun to entertain abandoning him en masse.\” https://t.co/…”
The following codes will help you trim unwanted/redundant characters in the 10,000 tweets you extracted from Twitter:
> trump_texts = gsub(“(RT|via)((?:\\b\\W*@\\w+)+)”, “”, trump_texts)
> trump_texts = gsub(“@\\w+”, “”, trump_texts)
> trump_texts = gsub(“[[:punct:]]”, “”, trump_texts)
> trump_texts = gsub(“[[:digit:]]”, “”, trump_texts)
> trump_texts = gsub(“http\\w+”, “”, trump_texts)
> trump_texts = gsub(“[ \t]{2,}”, “”, trump_texts) #Removes spaces
> trump_texts = gsub(“^\\s+|\\s+$”, “”, trump_texts) #Removes tabs
> trump_texts = trump_texts[!is.na(trump_texts)]
> names(trump_texts) = NULL
Next you have to run the “corpus” and “score” functions in order to turn those tweets into polarity level. In other words, you are telling R to read into texts and reveal the thoughts and messages behind these texts!
> text.corpus <- Corpus(VectorSource(trump_texts))
> tm_tag_score<-tm_term_score
> install.packages(“devtools”)
> library(devtools)
> install_github(“mannau/tm.plugin.sentiment”)
> text.corpus <- score(text.corpus)
> meta(text.corpus)
> ncorpus <- meta(text.corpus)
Just a quick refresh on polarity levels used by R. The polarity number “1” and “-1” mean extremely positive view and extremely negative view, respectively. Number “0” means indifferent. Numbers between “0” and “-1” and between “0” and “1” represent mild opinions that fall somewhere between indifferent and extreme.
Now you can run the ggplot function to show the polarity results on a histogram:
> ggplot(sentiment_result,aes(sentiment_result$values)) + geom_histogram(col=”orange”,fill=”green”,alpha=0.2) + xlab(“Polarity”) + ylab(“Count”) + ggtitle(“Histogram for Polarity”) + scale_y_continuous(breaks=pretty_breaks(n=10),limits=c(0,1000))
Looking at this histogram, some interesting insights I obtained are: 1. The general public’s views on Trump do not seem as extreme and bipolar as the media depicts. 2. The proportion of social media users with slightly positive views (polarity level between 0 and 0.5) toward Trump are a little bit higher (roughly 18% higher) than the proportion of social media users with slightly negative views (polarity between -0.5 and 0) toward Trump. However, we have to proceed with caution in the interpretation of these results because although 10,000 may be good enough for demonstration purpose, it may not be highly representative of the true population since the total number of tweets related to Trump is definitely a lot more than that. My next step would be to try to run sentiment analysis based on a larger sample.
Now comes the final step: word cloud. ^_^ Before writing the “wordcloud” function, you want to minimize displaying words that do not convey any opinion (e.g. Words like “you”, “me”, “I”, “them” appear all the time in tweets but do not represent user opinions). The “tm_map” function does just that:
>ncorpus <- tm_map(text.corpus, removeWords, c(“to”,”me”,”you”,”he”,”our”,”the”,”this”,”was”,”that”,”have”,”not”,”are”,”from”,”and”,”for”,”i”,”with”,”when”,”should”,”which”,”what”,”how”,”trump”,”but”))
Finally comes the word cloud:
> wordcloud(ncorpus,scale = c(3,.5), max.words=70, random.order = FALSE, colors=brewer.pal(12,”Paired”))
This is all for today. Hope you enjoyed reading this blog and have became more familiar with sentiment analysis using R. Please feel free to leave any comment/question. See you all next time! 😀
2 thoughts on “Sentiment Analysis with R”