In this blog I would like to introduce a supervised machine learning tool called K-Nearest Neighbors (KNN) algorithm. The algorithm “studies” a given set of training data and their categories in an attempt to correctly classify new instances into different categories. KNN can be applied to classification or regression. This blog will only talk about classification using R program.
The dataset we’ll use is called “classification_spec” and it consists of 144 observations and 5 variables. The first 4 variables (numeric) are GY.SiteA, GY.SiteB, CY.SiteA and CY.SiteB. GY and CY stand for granulation yield and compression yield, respectively, of a hypothetical pharmaceutical product. Site A and Site B are manufacturing sites of these hypothetical product. The numbers in this dataset are randomly generated and do not represent any real drug data. The fifth variable is Type, which indicates whether the yield of the product is high, medium or low. You can download the dataset via the following link:
Let’s get started with the R codes:
> knn <- read.csv(“C:/Users/Shaolun/classification_spec.csv”,header=TRUE)
The code above reads the csv file into R and assigns the data into a variable called “knn”.
This code scales the data so that the spreads of data are more consistent. Since the range of data for the 4 numeric variables are very similar to each other, there’s no need to scale data in this case.
> str(knn) This code outputs the data structure of all 5 variables.
This code displays all column names of the dataset:
 “GY.SiteA” “GY.SiteB” “CY.SiteA” “CY.SiteB” “Type”
The above code displays a statistical summary of all variables.
Next thing to do is to install the "class" package which helps with running KNN algorithm in R. > install.packages(“class”) > library(class) Now we have to split data into training set and testing set. The common suggestion is to put 2/3 of data into training set and 1/3 of data into testing set.
> set.seed(1234) Setting the seed allows you to get the same random numbers whenever you call the same seed function. By doing this you can keep track of the same random numbers easily.
> knn_split <- sample(2, nrow(knn), replace=TRUE, prob=c(0.67, 0.33))
The “sample” function takes a sample with a size same as the number of rows in the “knn” data set, which is 144. The “2” inside the function means you assign either “1” or “2” to the 144 rows of the “knn” data set. The assignment of 1 has a possibility of 0.67 and assignment of 2 has possibility of 0.33. The “replace=TRUE” argument means you can assign either 1 or 2 each time of assignment.
> knn.training <- knn[knn_split==1, 1:4] > knn.testing <- knn[knn_split==2, 1:4]
These 2 functions defined your training set and testing set, allocating the training set (1) and testing set (2) into the first 4 (numeric) variables. The fifth variable (Type) is left out in purpose because you want to test the accuracy of KNN algorithm’s prediction.
> knn.trainLabels <- knn[knn_split==1, 5] > knn.testingLabels <- knn[knn_split==2, 5]
These 2 functions kept the fifth variable (Type) in the background because without these, you won’t be able to tell if the algorithm’s prediction is correct.
> summary(knn.trainLabels) high low medium 37 33 36 > summary(knn.testingLabels) high low medium 11 15 12
Running the “summary” functions on the 2 labels allows you to verify that 106 observations are assigned to the training set and 38 assigned to testing set, which is close to the 2/3 and 1/3 ratio.
Now we are ready to test the KNN model!In the “knn” function, simply put in the names of the training set and testing set as well as the training label to the “cl” argument (cl is factor of classifications of training set, which is equivalent to the “Type” variable). “K” is the number of training samples needed to classify the testing sample’s class. So the larger the “k”, the more training sample is used to help predict the class of testing sample. A general rule of thumb for “k” is k=sqrt(n)/2, where n is the number of observations in your training set (106 in our case). Since “k” is calculated to be 5.15, I will round it to 5.
> knn_pred <- knn(train = knn.training, test = knn.testing, cl = knn.trainLabels, k=5)
Running “knn_pred” gives you the predicted classification output of KNN algorithm for the testing set.
To see how accurate is the predicted output versus the actual classes in the testing set, you can run the following lines codes to put the KNN prediction and testing set classes into a data frame and then compare them side by side.
> Predicted_Type <- knn_pred > Actual_Type <- knn.testingLabels > data.frame(Predicted_Type, Actual_Type)
As you can see, the accuracy rate in this case is 100%! Based on this observation, KNN seems to be a very impressive algorithm to use. One thing we need to be careful though is that our dataset only contains 144 observations (even fewer after it got split into training set and testing set). To really see the power of KNN, and test its’ overall accuracy, you should try to run datasets that contain thousands or even millions of records. This blog serves as a lighthouse to those who are not familiar with KNN but are interested in getting started on this algorithm using small and simple datasets. As usual, feel free to leave comments or questions below 🙂