K-Nearest Neighbors Algorithm (Supervised Machine Learning) using R

In this blog I would like to introduce a supervised machine learning tool called K-Nearest Neighbors (KNN) algorithm. The algorithm “studies” a given set of training data and their categories in an attempt to correctly classify new instances into different categories. KNN can be applied to classification or regression. This blog will only talk about classification… Continue reading K-Nearest Neighbors Algorithm (Supervised Machine Learning) using R

Statistical Analysis on Data Frames with Python and R

In this article I will talk about Python and R codes used to strip null values and convert csv data tables to data frames for the purpose of running practical statistical analysis smoothly. The key statistical methods discussed will be Mann-Whitney-Wilcoxon Rank Sum test and one-way ANOVA. While there are plenty of articles on the… Continue reading Statistical Analysis on Data Frames with Python and R

Text Mining – Sentiment Analysis on Autonomous Driving

Hello All!  In this blog I will walk you through doing sentiment analysis through Twitter data using R. Sentiment analysis is the study of people’s views on a topic through their participation in social media and how those sentiments affect a related object or phenomenon. Some examples of sentiment analysis include: studying Twitter users’ comments… Continue reading Text Mining – Sentiment Analysis on Autonomous Driving

Handling Big Data with R: The Efficient Ways

As information technology advances and globalization of businesses progresses, the need to efficiently manage large and complex data becomes more critical. While many are aware of the growing importance of big data, professionals who are not directly involved with data analysis might not have experience handling large raw datasets, let alone the ability to decipher… Continue reading Handling Big Data with R: The Efficient Ways

Building Interactive Website using R and HTML + Data Interpretation

Link to my interactive analytics site:


A lot of people who have used R program extensively for statistics and data visualization would likely be amazed by the program’s deep customization ability and mathematical prowess. In recent years, R has quickly gained popularity at the expense of Matlab and SAS1. One reason is that R is the only widely-used analytics program that is both open-sourced (SAS University Edition is also free, but lacks some features of full-version SAS) and has a lot of online forums dedicated to answer code-related questions. Even for data science concepts such as machine learning, R usually has the right tools for you2. Building upon these R advantages, I have created a simple interactive website on auto sales analysis using R and html codes. The site contains different types of statistical charts based on the readers’ selection to allow readers quickly see Volvo cars’ sales volume broken down by year, country and model, as well as a volume comparison with other premium automakers and some analytical insights at the end. This blog briefly describes the fundamentals of making interactive website using R and html and gives interpretations of the charts present in the interactive site.

Fundamentals of Shiny application and interactive site:

First of all, my interactive site is divided into 6 sections: main topic, raw data, sales volume by brand, sales volume by country and model, sales growth/decline in major markets and analytical insights. Looking at my URL, you will see the keywords “shinyapps.io”. That’s important to note because Shiny (R-embedded package) is the building block of interactive applications for R. To get started, you’ll have to install the Shiny package in R. Then in R directory, create a folder that will hold the raw data files (preferably in CSV format), R code files and pictures that pertain to the webpage you want to make. When that’s done, create 2 R command text files (the area above the Console window) called server.R and ui.R. These 2 R files are used to store codes for your webpage: ui.R dictates the titles, styles and general layout of your site, whereas server.R includes R codes to create the actual graphs and plots. Html codes should also be included in the 2 files. Now, if you have never created interactive websites before like myself, you might be already intimidated by the numerous steps and details I have given so far. Don’t be! Writing R codes for interactive objects in ui.R and server.R files are actually not that difficult, if you are willing to invest some time learning Shiny and simple html codes, utilize the R coding skills you have gathered so far and think creatively on how to combine R functions together into holistic set of codes. Tutorials for html codes can be found on http://www.w3schools.com/html/. As for Shiny R codes, http://shiny.rstudio.com/tutorial/ was my best teacher.

Let me show an example of R Shiny codes here. The 5th section of my interactive site (titled “Volvo sales growth (or decline) by major market”) displays the “country sales volume” line chart based on what user selects for country. The user selection syntax are imbedded in ui.R file, (remember, ui.R creates the interactive objects in Shiny application) whereas the chart creation codes are stored in server.R file. Below are the codes used to generate the “country sales volume” line charts:



In these codes you can see the typical ggplot function that includes sub-functions like “aes_string”, “geom_text” and “ggtitle”. Within the “aes_string” function there’s a part that reads “y=input$Country”. This part basically tells Shiny to pick up whatever the user selects as Country and apply that as the y-axis variable. So if you selected Sweden, Shiny will read from the list of numeric under the Sweden column and use them as the sales volumes on y-axis. To encourage self-learning and creativity, I will not include the rest of my codes in this blog. If you are interested in creating a R Shiny site, I would highly recommend this tutorial site to you. With determination and willingness to think outside the box, you’ll be able to create a quality interactive site that is on par, if not better, than mine.

Background of analysis and interpretations of charts:

Now that I’ve explained the fundamentals and structure of my R Shiny website,  let’s talk about the purpose of the analysis and the actual data and charts embedded in the website. The goal of analysis is to forecast Volvo cars’ 5-year sales volumes in 2016-2020. I selected Volvo for this analysis due to 3 reasons: 1. It is the world’s 5th largest premium automaker in terms of sales volume. 2. It experienced much lower growth than the other 5 best-selling premium auto brands over the past 16 years yet it’s able to maintain its position. 3. It’s aiming for an astonishing 60% sales growth over the next 5 years (from 500,000 units in 2015 to 800,000 units in 2020)! All the charts are based on 6 raw datasets I included right underneath the title section. They are self-explanatory and you can download each set via the Download button.

Because charts in the upper 5 sections (from “Main Topic” to “Sales growth/decline in major markets”) are easy to understand, I will focus on explaining the charts present in the “Analytical Insights”  section.


Profit-volume scatterplot: First off we have the Profit-volume scatterplot. As its name inplies, the chart  plots sales volume on the x axis and profit margin on the y axis. The point is to check if sales volume and profit margin move in the same direction. Keep in mind that in the business world these 2 metrics do not always move together: when average selling price of the products go down by a larger magnitude than growth in sales volume, profit margin may go down even as sales volume is up. (Toyota Motor, Samsung and Apple have all experienced this sometime in the past) Looking at the positions of the points and the blue linear trend line in the scatterplot, we can confirm that in Volvo’s case, more sales does result in higher profit margin, which is certainly a positive thing for the automaker.


Sales growth/decline chart: Moving forward to the “Sales growth/decline chart” yields a slightly different outlook. Here you can clearly see that Volvo has experienced a lot of ups and downs between 1992 and 2015. While sales volume grew by about 20% in 2011, it dropped by more than 18% in 2008 perhaps due to the global financial crisis. Even worse, Volvo experienced sales decline in 7 out of 17 years over the last 24 years. That’s an alarming 41%! This volatile change in volume is an important reason for Volvo’s modest growth in the previous 16 years, compared to the stellar growth of its competitors (e.g. Lexus and Land Rover).


Sales process engineering chart: The name is attributable to sales process engineering concept by prominent engineer W. Edwards Deming. If you’ve read my previous blog on Statistical Process Control (SPC), you would see a strong resemblance between this chart and the manufacturing SPC chart. Well, sales process engineering is essentially the application of SPC on sales analysis. It was first used to gauge stability and trend of salesmen performance on a weekly basis. Here instead of putting weeks on the x axis, I’ve used year instead. A rule of thumb for SPC is that we should have at least 25 observations. That criteria is met since I have sales volume data for exactly 25 years. Looking at the chart, the red point indicates it’s on or outside of the upper/lower control limit lines whereas the yellow point means it’s a violation of Western Electric Rule. In the case of our chart the yellow points are not indication of issues since there is no obvious deviation of the process and that the points are only present in a small section of the chart. While in manufacturing SPC we want observations to be inside the UCL and LCL lines are stay as close to the CL (center line) as possible, in sales SPC the ideal situation is to have some points above the UCL line. The rationale is simple: a stable sales process with some above-norm sales volume represents potential for outstanding growth in the future. In Volvo’s case, the last 3 years have shown robust growth, with year 2015 protruding outside the UCL line with more than 500,000 units sold.  Volvo’s recent uprise is due to success of the all-new XC90 SUV. The vehicle has won numerous awards and praises by auto enthusiasts throughout the world and recently became the third best-selling model in Volvo’s lineup. 


Time series forecast chart: In this chart the black dots represent historical sales volume. Green dots represents forecast using Holt-Winters smoothing model. Yellow dots represent forecast from exponential smoothing state space model, whereas red dots indicate simple ARIMA time series forecast. I’ve already discussed ARIMA forecast in one of my previous blogs and it’s one of the simpler time series forecast models based on seasonality and trend. The P, D and Q terms for my ARIMA forecast are generated by R’s auto.arima function and they are 0, 1, 0 respectively. The forecast results using ARIMA are very close to 500,000 units, meaning ARIMA doesn’t think Volvo sales would grow by much. While ARIMA is simple, it doesn’t deal with seasonality and trend as good as Holt-Winters does. Looking at the interval from 1991 to 2015, it is clear that there’s upward trend. Thus it would be more reliable to refer to Holt-Winters method in this case. The forecast results using Holt-Winters are much higher than that under ARIMA, with an estimated 749,333 units sold in 2020 (about 50% growth from 2015). With regards to exponential smoothing state space model, it yields results that are between ARIMA and Holt-Winters. Due to high complexity and limited resources of state space model, I am still in the process of learning it and I included the model here because it’s one of the more advanced time series methods. I will update this section once I’ve gained more knowledge about the model. Until then, we can rely on Holt-Winters since it’s a reliable, objective model that adjusts for trend and seasonality.  

So how likely is Volvo going to achieve its’ 800,000 units sales goal in 2020? My opinion is that Volvo is being a little too ambitious with its target. While Volvo is releasing the all-new S90 midsize sedan, V90 wagon and at least 6 new vehicles by 2019, by that time Volvo’s lineup (9 vehicles) still pales in comparison with the German automakers which all have at least 14 vehicle models now. In addition, the Germans as well as Lexus already have a lot of new or replacement products in the pipeline and they are currently selling considerably more than Volvo. On the positive side, Volvo’s new design language is favored by many and its’ first-class safety standard as well as advanced autonomous driving tech could bode very well for the automaker. All aforementioned factors led me to believe the 50% growth rate forecasted by Holt-Winters model is more attainable and realistic than the 800,000 units goal. Feel free to share your thoughts on this analysis and/or my interactive website in the comment sections! 🙂

1. http://www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results/

http://blog.revolutionanalytics.com/2015/11/new-surveys-show-continued-popularity-     of- r.html

2. https://www.datacamp.com/community/tutorials/machine-learning-in-r

K-Means Clustering – Unsupervised Machine Learning in Biotech

As I’ve promised in my previous blog, I will share in this blog my experience applying K-means clustering to some tasks in my workplace. K-means is a form of unsupervised machine learning. Let me explain what supervised machine learning and unsupervised machine learning are. In supervised machine learning, you are given a set of inputs… Continue reading K-Means Clustering – Unsupervised Machine Learning in Biotech

Time Series Forecast-A Supply Chain Approach Using R and SAS

It’s been 5 months since I finished my Data Analyst contract with the Supply Chain team at Google. After reading an interesting article about supply chain planning in Wall Street Journal, I’ve decided to write this blog to highlight some new observations I’ve made on return quantity forecasts using R’s time series. During the project… Continue reading Time Series Forecast-A Supply Chain Approach Using R and SAS