It’s been 5 months since I finished my Data Analyst contract with the Supply Chain team at Google. After reading an interesting article about supply chain planning in Wall Street Journal, I’ve decided to write this blog to highlight some new observations I’ve made on return quantity forecasts using R’s time series.
During the project with Google I was tasked with data cleaning and streamlining of forecast models on return quantities of certain Android devices. When I first joined the project, the primary forecast model used by the team to forecast return rate of a particular month was the supply chain moving average method (somewhat similar to the moving average method used in inventory accounting but much more complicated). The model is fitted within Excel workbook which contains historical return rates of the product. It was developed by seasonal consultant who were no longer in the project when I joined. While the model has satisfactory accuracy rate, I found it a bit time consuming and error-prone to execute. This is because every month when the stakeholders runs the model they have to update a lot of information and do a lot of manual updates (including revision of cell references within formulas), not to mention there are nearly 10 products (each containing 4 or more variants) that require forecast update each month. The longer the product has been on market, the more time it takes to update the model. A small error in the revision will mess up the whole model and render it inaccurate.
When the product is relatively new to the market and not a lot of information about return quantity is available, this supply chain model is the ideal choice. This is because the model uses a legacy product, a product with similar properties to the product being forecasted, as an estimation tool which solves the lack of historical data problem. For products that has been on the market for a while and has a life cycle of 3-4 years, it requires a lot more time and efforts if the default supply chain model is used. Because of this, I started thinking about practical alternatives to the supply chain model for long life-cycle products, and here the story begins.
Aside from supply chain moving average model, multivariate regression, time series and other complicated algorithms are all possible alternatives. A regression model sounds good in theory, but it’s challenging to implement in this case because we are not sure of the primary determinants of inventory return (price and rebate policies might be some factors, but what else?). After ruling out the other options, I’ve discovered that time series is the choice left for consideration. Used thoroughly in financial and scientific settings (e.g. GDP prediction, forecasting chemical changes etc.), time series has not been very popular in logistics industry. But as a passionate data analyst, I decided to give it a shot with the data I have available (in this case we have return rates of the past 24 months already). Because I wasn’t sure if there’s seasonality in return pattern, I decided to apply the conservative autoregressive integrated moving average (ARIMA) time series model here.
So right below this paragraph is the return quantities of the first 24 months of the product’s life cycle. The product is expected to remain on market for at least 36 months and I had to make the model predict the return rates for the upcoming 3 months. I have altered the numbers a little bit for the sake of confidentiality. But the idea is to show how the ARIMA time series model works nicely with supply chain.
The return rate is determined by return quantity divided by sales quantity from July 2013 through June 2015. The goal was to forecast the return rate from July 2015 through September 2015. Respecting confidentiality of Google data, I won’t disclose the product name. I will use R program to run the time series analysis.
Model description and R codes:
To run the ARIMA (autoregressive integrated moving average) time series model, we first need to give values to the 3 parameters required by the model. These are P(autoregressive), D(differencing) and Q(moving average term). Because being stationary (having constant mean, variance and autocorrelation over time) is required for such analysis, the D term is the number of non-seasonal differences required for stationalizing the chart. When testing the optimal paramters in R as shown below, the software suggested the combination (0, 1, 0), meaning 0 autoregression, 1 non-seasonal difference to transform the data into stationary set.
To visualize the accuracy of this result, we can tell R to run the sample autocorrelation function (test for P, autoregression) and partial autocorrelation function (test for Q, moving average term).
Moving average term chart:
As you can see from both charts, the data are for the most part normal. With only the first and second lag surpassing the upper limit for the autoregression chart, and for the moving average term chart, the first lag slightly exceeding the upper limit. Thus, we can trust R for its parameter suggestions in this case.
Now I will plot the actual time series chart based on the given data. The syntax I wrote in R to generate the time series chart, step by step, were:
>plot(as.xts(basis),major.format=”%Y-%m”,main=”Monthly Return Rates”,xlab=”Time”,ylab=”Return Rate”,las=1,ylim=c(0, 0.20))
Notice the first step that installs the “xts” package is important because it allows me to plot the time series data in a neat format. “nforecast” is the data set that contains the 24-month return rates. Once all these codes were entered, R would generate the chart below:
After running the ARIMA syntax with the 3 system-generated parameter values, I have arrived at the forecast results for July 2015-September 2015:
Based on the ARIMA forecast, the return rates for July 2015-September 2015 were all 5%, but with different confidence interval at 80% and 95%. This makes sense because when we look back at the July-September period at 2013 and 2014, they were 0.11, 0.11, 0.13 for 2013, and 0.10, 0.10, 0.11 for 2014. This means return rates have been very consistent for those months in 2013 and 2014. Also if we look at the original time series chart, we can spot a small seasonal trend for those 2 years. July through October tend to have the highest return rate whereas March through May the lowest. Interestingly, the July-October period coincides with a lot of the major product release dates of US-based technology companies. So if direct competitors introduce some compelling products with comparable prices, it could be one of the reasons why return rate is higher in those periods. The trend is not large though, otherwise the differencing term (D) generated by R would’ve been 0 instead of 1.
When using Holt-Winters forecast, which is a bit more advanced than ARIMA and arguably more reliable when seasonality and trend are present, the results were smaller but still highly consistent for the 3-month period. The R code and output for Holt-Winters model are as follow:
Some of my readers might be business analytics professionals who use primarily SAS instead of R. The good news is ARIMA model can be ran in SAS effectively as well! In SAS Studio (or SAS Enterprise, depending on which version you use. Syntax are the same regardless of the version), simply run the following syntax which imports the Excel sheet into SAS and then does the ARIMA test, and you will get the subsequent data summary, autoregression and moving average term charts:
As you can see here, the autoregressive chart and moving average term chart are identical to those we obtained from R. The only stylistic difference is that in SAS the bars are bolded and thicker than R. Although the ACF and PACF charts do not show obvious outliers (meaning P and Q terms can be set to 0), because the time series chart appears a little nonstationary, we’ll need to find out the differentiating term to make the data stationary. This has to be done because unlike R, SAS does not have an auto ARIMA function that calculates the 3 ARIMA parameters to you. So you have to do some manual estimation by yourself in SAS.
Let’s test the stationality of data for a differentiating term of 1. To do that, simply rerun the previous syntax but put a “(1)” after the variable name as follow:
As you can see from the new charts generated, the autocorrelation chart (IACF) decreases dramatically, which is an indication of stationary time series. You can also see that the new time series chart demonstrates a recurring pattern, a result of the stationary transformation. Now we know D=1 is a good parameter for this ARIMA test, along with P=0 and Q=0.
Run the following syntax to arrive at the forecast results:
|Forecasts for variable Return_Rate|
|Obs||Forecast||Std Error||95% Confidence Limits|
As you can see, the forecast results using SAS is almost identical to the results from R. The very small difference is likely due to rounding and not a calculation difference. So there you have it: time series ARIMA forecast using R and SAS.
Before I came to the conclusion that ARIMA is the optimal forecast method in this scenario, I’ve done some research on the web to check if there’s any other superior methods to ARIMA. I then found out there’s a similar time series method called Exponential State Smoothing (ESS). To be on the safe side and not limit my options, I have also performed the forecast using this less complicated method. The ESS method is simpler to implement since it doesn’t contain ARIMA parameters. However, the lack of stationary transformation among other mechanisms means accuracy might be slightly suffered under this method. As seen below, the results from ESS were 0.075, 0.074 and 0.086, slightly higher than the results from ARIMA.
So how does my forecast compare to real result? It turns out the actual return rates for July-September 2015 were around 4.7%, thus closest to Holt-Winters forecast of 4.5%. In addition, the accuracy rate was almost 92%, compared to 88% using the traditional supply chain moving average method. After learning the high accuracy of Holt-Winters model, the Google supply chain specialists were very happy about the outcome.
I am grateful that I have truly learned a lot from this experience. The takeaway of this is that just because an idea hasn’t been implemented on an established field doesn’t mean it can’t be implemented with success. The key is to think outside the box and be willing to take a tool widely used in one field and test it on other fields. In the next blog I will share the application of K-means clustering and dendrogram in real world, which are some useful tools in data analysis and something I occasionally perform at my current job with a biotech company.