K-Means Clustering – Unsupervised Machine Learning in Biotech

As I’ve promised in my previous blog, I will share in this blog my experience applying K-means clustering to some tasks in my workplace. K-means is a form of unsupervised machine learning. Let me explain what supervised machine learning and unsupervised machine learning are. In supervised machine learning, you are given a set of inputs and outputs and you try to use these data to predict what is going to be the output sometime in the future. Some regression and classification algorithms belong to supervised machine learning because you already have values for some independent variables and dependent variable and you are trying to forecast the value of the dependent variable at some point in time. On the contrary, you do not have the input provided in unsupervised machine learning. What you have is a set of output and from these output you try to group them into different clusters with similar characteristics in each cluster. Ideally, each cluster would have similar number of observations.

In my current job with a Bay Area pharmaceutical company, my daily tasks involve performing statistical process control analysis and continue process verification on pharmaceutical manufacturing data, ranging from water content value to average weight of coated tablets. While the data I pulled from the database are always classified into different groups already, the fact that I have gained knowledge in clustering algorithm through Professor Andrew Ng’s online Machine Learning course on Coursera heightened my curiosity to test machine learning methods on my dataset. The results were promising. I was able to use K-means clustering to group a given set of water content values and metal detection reject data into groups that closely match grouping in the original dataset! Having said that, I won’t use the actual dataset for demonstration here due to confidentiality. I have revised my data slightly while preserving a similar structure as the original observations. The dataset called “spec” is attached below.

spec

There are 150 records in this dataset, with 4 attributes in each record. The 4 attributes are numeric variables and they are Water Content Value at site A, Water Content Value at site B, Metal Detection Rejects at site A and Metal Detection Rejects at site B. The “Types” variable is categorical and are classified into low, medium or high specification. The attributes (columns) are abbreviated as following, where WCV and MDR stands for Water Content Value and Metal Detection Rejects, respectively:

WCV.SiteA

WCV.SiteB

MDR.SiteA

MDR.SiteB

Types

Before we can start testing K-Means cluster algorithm, we have to replicate the original dataset sans the Types column since we want to know if the algorithm is able to classify data into 3 groups like the original dataset. The dataset is called “specification”, so we set the Types column to NULL. The R code as well as a small portion of the resulted dataset is shown below:

data sans types

Now that we have a new dataset where all variables are numeric, we can standardize its scale so that data can be comparable. This won’t change values of the dataset. It simply stretches or narrows the y-axis scale if data are not already comparable. The resulted dataset will be saved as “scaledata” in R:

>scaledata<-scale(specification)

Now we must determine the optimal number of clusters. For this I want to thank STHDA.com (Statistical Tools for High-throughput Data Analysis) for displaying some useful packages for this analysis. Let’s proceed:

We first have to install the aforementioned R packages and load them by calling out their libraries:

install packages

Next we run the following codes to plot the “elbow” chart to determine where is the elbow point (that is where the absolute value of the slope of the line first start to flatten):

code for elbow chart

elbow plot

As you can see from the elbow chart, the slope (in absolute value) of the line first starts to flatten when number of clusters equals 3. That means the number of clusters that can be divided into from the dataset will be 3.

We can now start classifying observations into the 3 clusters and save the outcome into “results”:

>results<-kmeans(scaledata,3)
>results

>results$centers
results centers

Next we can see how K-means clustering compares to real results:

>table(original$Types,results$cluster)

From this table we can see that “low” has 50 observations in the original dataset, and that they’ve all been grouped into cluster 3. (Notice the cluster 3 clolumn doesn’t have any other number above or underneath 50. That means no other data points from “low” have been mistakenly grouped into “high” or “medium”.) So cluster 3 is perfectly classified! For “high” and “medium”, each has 50 observations in the original dataset. However, clusters 1 and 2 are slightly off as we have 53 in cluster 1 and 47 in cluster 2 (still very close to perfect).

Now we can visually see the groupings of water content values and metal detection rejects for both the undisturbed dataset and the new dataset using K-means clustering:

>plot(original[c("WCV.SiteA","WCV.SiteB")],col=results$cluster)

>plot(original[c("WCV.SiteA","WCV.SiteB")],col=original$Types)

>plot(original[c("MDR.SiteA","MDR.SiteB")],col=results$cluster)

>plot(original[c("MDR.SiteA","MDR.SiteB")],col=original$Types)

As you can see from the charts above, the groupings done by K-means algorithm is very close to actual data, which is impressive given that R was able to produce this satisfying results without much instruction.

The exact same algorithm can be performed in SAS as well, albeit with a little more complicated codes. Below is the list of SAS codes to arrive at the elbow curve:

PROC IMPORT OUT= cluster DATAFILE= “/folders/myfolders/spec.csv”
DBMS=csv replace;
run;

proc standard data=cluster out=scalecluster mean=0 std=1;
var WCV_SiteA WCV_SiteB MDR_SiteA MDR_SiteB;
run;

%macro kmean(K);

proc fastclus data=scalecluster out=outdata&K. outstat=cluststat&K. maxclusters= &K.;
var WCV_SiteA WCV_SiteB MDR_SiteA MDR_SiteB;
run;

%mend;

%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
%kmean(10);

data clus1;
set cluststat1;
nclust=1;

if _type_=’RSQ’;