Hope everyone is well and enjoying the warm weather as we proceed into summer. In this blog I want to share my practical experience on statistical process control (SPC) as well as its implementation using R and SAS. Statistical process control is a quality control method used by manufacturing industry or companies that utilizes Six sigma or deals a lot with manufacturing data. Some of the big companies that deploy statistical process control are: Qualcomm, Intel, Genentech, Honda Motors and General Motors. Although I have done a lot of statistical analysis before, I knew very little about this manufacturing-oriented statistical concept until I started working on my current job at a major biotech company, in which SPC analysis became a major component of my daily tasks. Read on and I think you will find it interesting.
The steps of statistical process control are: 1.During (preferred) or after manufacturing process, collect data that are representative of the quality of the product. 2. Plot the data on a process control chart with control limits determined by the spread of these data. 3. Spot any outlier above or below the upper control limit (UCL) line or lower control limit (lcl) line, or identify any upward/downward trends that might push the data over the control limits in future periods. 4. If any quality issue is found from the chart, notify process engineers or subject matter experts for remedial actions. In addition to the UCL, LCL and center line (Also known as CL. You can think of this as the overall average of all observations). Sometimes you will also see the upper specification limit (USL) and lower specification limit (LSL) lines on the chart. Normally these lines are wider than the control limit lines so you should see USL line above UCL line and LSL line below LCL line. Think of the specification lines as the “absolute limits” set by industry standards and, unless there’s a process change, any data that fall outside the spec limit lines should be called for immediate attention. The manufacturing process is called “in-control” if most of the data are near the center line and there’s no outlier or visible trends that would push the data beyond the control limits. For simplicity, this blog will focus on control limit lines.
Let’s get to the real story now! Here I have a dataset that contains 28 observations for a process parameter called Final Yield. The Excel spreadsheet containing these data are attached below:
Because there’s only 1 observation for each lot. The process control chart we will run is called the individual chart. Assuming the name of your data file is “sampledata.csv”, you can run the following codes to arrive at the individual chart.
>yield_I<-qcc(yield,type=”xbar.one”,plot=TRUE,title=”Individual Chart”,ylab=”Final Yield”,xlab=”Lot Number”)
There are couple elements you should be aware of in this chart. First of all, lot number 12 and 28 are much lower than the rest of the observations. Secondly, the lots near 28 have a downward sloping trend that you should watch out for because there’s a chance that the next few observations would touch or even cross the LCL line.
Another type of process control chart is the moving range chart. This chart takes the difference between each consecutive observation and plots their absolute value one by one. The purpose is to see how much the data has changed over time. The following R codes are used to generate the moving range chart. Note that the variable “yield” has already been defined as a vector that contains the Final Yield data.
>yield_movingrange<-qcc(yield.mr,type=”R”,plot=TRUE,title=”Moving Range Chart”,ylab=”Moving Average”,xlab=”Observation”)
As you can see, the biggest change comes between lot number 12 and 13, as lot 12 is much smaller than lot 13 in the individual chart. Also, the downward rate of change increases as we go from lot 23 to lot 28 in the individual chart, which again can be observed starting in observation #23 in the moving range chart.
Now we can run the Shapiro-Wilk test, which is a normality test widely used by manufacturing firms including pharmaceutical companies. The test applies the famous null hypothesis principle to check whether a sample dataset came from a normally distributed population. If the p-value obtained from this test is greater than alpha=0.05 (or whatever significance level you specified), then the underlying population is normally distributed. The R code and result are shown below:
Since the p-value is 0.5269>0.05, we are at least 95% confident that the Final Yield data is normally distributed. Again, although the data are normal, we should still be cautious because 2 of the observations are much lower than the center line and almost reaching the LCL line. There’s also a downward trend towards the last few observations that could push future lots below the LCL line. The next step will be to identify the USL and LSL. If the LSL is much lower than the LCL, then perhaps the downward trend is not an issue as long as it doesn’t persist.
In statistical process control and manufacturing industry in general, data are often divided into different groups. A parameter might contain more than 1 observation per lot. This happens quite often when a product undergoes the same quality test multiple times. If that’s the case, we’ll have to run the X-bar chart which plots the mean value for each subgroup. The X-bar chart looks very similar to individual chart except that each point on the X-bar chart is average value of multiple observations.
The attached Excel file below contains the raw data for 104 observations, and every 4 observations belong to 1 group.
Before we can generate the X-bar chart, first run the following R codes to group data into 26 subgroups:
Group data into 26 observations:
 26 4
Now if you call out the “grouping” variable, you will get the following results:
This output confirms that the groupings were done correctly. Now we can generate the X-bar chart by running the following code:
>Xbar<-qcc(grouping[1:26,],type=”xbar”,title=”X-bar Chart”,ylab=”Average Yield”)
The resulting chart is shown below:
You can see that the chart is very consistent with no outlier on or below the UCL/LCL lines. Just to be sure let’s run the Shapiro-Wilk normality test on this dataset. You first have to get the mean value of each subgroup. After you’ve obtained the mean values of the 26 subgroups you can put those values into a vector and then run the Shapiro test.
Because the p-value is 0.5217>0.05. We are at least 95% that the data came from a normally distributed population. This is not really surprising since most observations are close to the center line and even the 2 observations with highest values are far from the UCL line.
For readers who use primarily SAS, the same SPC procedure can be done in SAS by running the following codes, assuming the header row has been removed:
Because the SAS I am using is a free version (the full version costs at least $10,000 in annual licensing fee), it doesn’t have with the SAS/QC Software built-in. If you have SAS full version installed in your company or university, you should be able to get the X-bar and R charts similar to R by running the above codes. The output charts should look similar to the picture below, which is a sample I obtained from:
To wrap things up, in this blog I have explained the concept of statistical process control and how to run it using R and SAS. If you’ve also read my previous blogs on K-means clustering and time series forecast, and have not used R extensively before, you might already be amazed how useful and versatile R is in statistical analysis. Still, there are a lot more you can do with R than what I have written about it. Be sure to check my website periodically for updates on useful tips and popular applications of R. If you have any question or would like to share your thoughts, please feel free to leave your comments in this blog.