K-Nearest Neighbors Algorithm (Supervised Machine Learning) using R

In this blog I would like to introduce a supervised machine learning tool called K-Nearest Neighbors (KNN) algorithm. The algorithm “studies” a given set of training data and their categories in an attempt to correctly classify new instances into different categories. KNN can be applied to classification or regression. This blog will only talk about classification using R program. 

The dataset we’ll use is called “classification_spec” and it consists of 144 observations and 5 variables. The first 4 variables (numeric) are GY.SiteA, GY.SiteB, CY.SiteA and CY.SiteB. GY and CY stand for granulation yield and compression yield, respectively, of a hypothetical pharmaceutical product. Site A and Site B are manufacturing sites of these hypothetical product. The numbers in this dataset are randomly generated and do not represent any real drug data. The fifth variable is Type, which indicates whether the yield of the product is high, medium or low. You can download the dataset via the following link:


Let’s get started with the R codes:

> knn <- read.csv(“C:/Users/Shaolun/classification_spec.csv”,header=TRUE)

The code above reads the csv file into R and assigns the data into a variable called “knn”.

> scale(knn[1:4])

This code scales the data so that the spreads of data are more consistent. Since the range of data for the 4 numeric variables are very similar to each other, there’s no need to scale data in this case.

> str(knn)
This code outputs the data structure of all 5 variables.
str data

  > names(knn)

This code displays all column names of the dataset:

[1] “GY.SiteA” “GY.SiteB” “CY.SiteA” “CY.SiteB” “Type”  

> summary(knn)

The above code displays a statistical summary of all variables.

summary data

Next thing to do is to install the "class" package which 
helps with running KNN algorithm in R.

> install.packages(“class”)
> library(class)

Now we have to split data into training set and testing set. 
The common suggestion is to put 2/3 of data into training set 
and 1/3 of data into testing set.
> set.seed(1234)

Setting the seed allows you to get the same random numbers 
whenever you call the same seed function. By doing this you 
can keep track of the same random numbers easily.

> knn_split <- sample(2, nrow(knn), replace=TRUE, 
prob=c(0.67, 0.33))
The “sample” function takes a sample with a size same as 
the number of rows in the “knn” data set, which is 144. 
The “2” inside the function means you assign either “1” 
or “2” to the 144 rows of the “knn” data set. The 
assignment of 1 has a possibility of 0.67 and assignment 
of 2 has possibility of 0.33. The “replace=TRUE” argument 
means you can assign either 1 or 2 each time of assignment.
> knn.training <- knn[knn_split==1, 1:4]
> knn.testing <- knn[knn_split==2, 1:4]
These 2 functions defined your training set and testing set, 
allocating the training set (1) and testing set (2) into the 
first 4 (numeric) variables. The fifth variable (Type) is 
left out in purpose because you want to test the accuracy of 
KNN algorithm’s prediction.
> knn.trainLabels <- knn[knn_split==1, 5]
> knn.testingLabels <- knn[knn_split==2, 5]
These 2 functions kept the fifth variable (Type) in the 
background because without these, you won’t be able to 
tell if the algorithm’s prediction is correct.
> summary(knn.trainLabels)
high    low    medium     
37      33     36 

> summary(knn.testingLabels)
high    low    medium     
11      15     12 
Running the “summary” functions on the 2 labels allows you 
to verify that 106 observations are assigned to the training 
set and 38 assigned to testing set, which is close to the 2/3 
and 1/3 ratio.
Now we are ready to test the KNN model!In the “knn” function, 
simply put in the names of the training set and testing set 
as well as the training label to the “cl” argument (cl is 
factor of classifications of training set, which is equivalent 
to the “Type” variable). “K” is the number of training samples 
needed to classify the testing sample’s class. So the larger 
the “k”, the more training sample is used to help predict the 
class of testing sample. A general rule of thumb for “k” is 
k=sqrt(n)/2, where n is the number of observations in your 
training set (106 in our case). Since “k” is calculated to be 
5.15, I will round it to 5.
> knn_pred <- knn(train = knn.training, test = knn.testing, 
cl = knn.trainLabels, k=5)
> knn_pred
knn predict
Running “knn_pred” gives you the predicted classification 
output of KNN algorithm for the testing set. 
To see how accurate is the predicted output versus the 
actual classes in the testing set, you can run the following 
lines codes to put the KNN prediction and testing set classes 
into a data frame and then compare them side by side.
> Predicted_Type <- knn_pred
> Actual_Type <- knn.testingLabels
> data.frame(Predicted_Type, Actual_Type)

As you can see, the accuracy rate in this case is 100%! Based on this 
observation, KNN seems to be a very impressive algorithm to use. One 
thing we need to be careful though is that our dataset only contains 
144 observations (even fewer after it got split into training set and 
testing set). To really see the power of KNN, and test its’ overall 
accuracy, you should try to run datasets that contain thousands or 
even millions of records. This blog serves as a lighthouse to those 
who are not familiar with KNN but are interested in getting started 
on this algorithm using small and simple datasets. As usual, feel 
free to leave comments or questions below 🙂





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s