After learning Support Vector Machines we are going to advance into the territory of kNN a.k.a KNearestNeighbors. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). kNN has been used in statistical estimation and pattern recognition. kNN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The kNN algorithm is among the simplest of all machine learning algorithms. So the way a kNN works is, once all the data points have been included in the model, it searches for the datapoints that are the closest to it i.e The Neighbors hence creating the name k-Nearest Neighbors. Here k stands for a variable which denotes how many number of neighbors i.e the closest datapoints to it.
Hey Guys this is Manas from csopensource.com – “Your one stop destination for everything computer science”. Today I am back with another freshly brewed ML Tutorial on K-Nearest Neighbors. Sit back and let your brain absorb the knowledge!
Classification & Learning process
kNN is a classification algorithm. This algorithm works by finding the euclidean distance between points of the nearest classes and the new point. The number of points that it needs to find the distance between is denoted by K. if K = 3 then then the model has to find 3 closest points and the majority of the closest distance is taken and the class is identified. For example we have the value of K as 3, two classes ‘A’ & ‘B’ along with a new point named ‘C’. How do we predict which class does ‘C’ belong to? The answer here is using a kNN Classifier. Let us see a diagrammatic representation
This diagram represents the way a kNN classifies a new point on the basis of the euclidean distance to other points.
Some more information
Here we can see that because our K value is 3, we are checking the distance to other points only 3 times. We can clearly make out that the new yellow point ‘C’ is closer to the class ‘A’ than the class ‘B’ simply because the total euclidean distance from the points in class ‘A’ is lesser than the ones in class ‘B’. Thus the kNN classifies this new point to be of type ‘A’. The more the value of ‘K’ more the accuracy of the classification. However there occurs a point of diminishing returns where a high value of K doesn’t give the best results but instead taxes the computer. We Should always choose a generous amount for K. The accuracy of this model is counted by dividing the number of closest points and the total K value.
In this case the accuracy is 2(closest points) / 3(value of K) so 2/3 = 0.66% so we have an accuracy of 66%. Generally we would not want to have an even K Value. Although a accuracy of 66% is not that great we have a good enough classification here. So for the purpose of this example it is fine. However to increase the accuracy, we can increase the value of K. This is a simple and intuitive understanding of how a kNN Classifies a new point.
Things to keep in mind
- The dataset that we are going to use needs to have as little noise as possible, here ‘noise’ refers to the overlapping of the target classes.
- We need to have labelled features in our dataset as a kNN is a supervised type of ML.
- Our dataset needs to contain only relevant features. If there are some features that do not contribute anything to the final answer like “patient name” etc. we have to drop it as it increases the complexity and decreases the overall accuracy.
- Avoid using large K values for a kNN working on a large dataset as it takes a lot of time to train because it needs to find the distance to other points n number of times, where n stands for the value of K.
- Some popular use cases of a kNN include:
- Recommendation Systems
- Credit Risk Analysis
- Predictive Trip Planning
- Stock price recommendation
- a kNN is a supervised Classifier
IMPORTANT!!! – WE HAVE TOLD THAT A kNN FINDS THE DISTANCE BETWEEN ITS CLOSEST NEIGHBORS FOR THE VALUE OF N. BUT WHAT DISTANCE? THE ANSWER IS THAT A kNN FINDS THE EUCLIDEAN DISTANCE BETWEEN THE CLOSEST POINTS.
Coding A Problem using a kNN Algorithm in Python
Now that you have a rough idea about what is a kNN algorithm, let us solve a simple problem in Python. We are going to work on a dataset which contains all the records of Heart Disease Patients. Our task is to classify which category does the new patient belong to! Firstly we need to download our dataset, to do that go to this link. This is a link to my Dropbox where I have added a file named ‘Heart.data’ click on download and save it to your working directory.
open cmd and enter pip install sklearn if sklearn hasn’t been installed already.
So Heart Disease Classification involves many features to be included in the dataset to give a valid prediction. Let us see a list of all the features that we need to use.
So this is a list of all the features required to predict a likelihood of a patient developing heart disease.
This program looks a little intimidating, let me explain it for you!
- In Line 1-6 we are importing all the necessary modules
- Line 7 We are importing the dataset using read_csv function of the pandas module
- Form Line 9-11 We are removing certain columns from our dataset as they don’t help much in the training and testing process. Also, they were filled with a lot of missing data and replacing all of them with outliers would hurt the efficiency of the model.
- In Line 13 We are replacing any left over missing datapoints with an outlier value i.e -99999, this ensures that the data is ignored by the model.
- In Line 15 we are creating the X values or the features, i.e Every thing in the dataset except the class values (these values are labels).
- Line 16 We are creating the Y values or the labels, these values help the model to perform supervised training. We create this by extracting only the class values from the dataset
- Line 17 We define our Classifier object and pass a K Value of 7, i.e I want to find 7 neighbors. You Can use nay number you want but I chose 7.
- In Line 18 The dataset is split into training and testing sets
- In Line 20 We train the kNN.
- On Line 21 We try to predict the case of a new patient by entering a few sample values into a numpy array, these values correspond to the features that I had mentioned in the feature description.
- On Line 22 We check the accuracy of the model using the test set
- From Line 24-31 We work on parts of code that show the output generated by our model.
And That’s it! We have successfully created a K-Nearest Neighbors Classifier using python with sklearn.
DISCLAIMER!!! – THIS SCRIPT THAT WE HAVE MADE CAN PREDICT IF YOU HAVE A 50% CHANCE OF DIAMETER NARROWING OF A MAJOR BLOOD VESSEL (WHICH MAY BE A CONTRIBUTOR TO HEART DISEASE). NOTICE THE 50%, THIS LIMITATION IS IMPOSED BY OUR DATASET. IF WE HAD GOT OUR HANDS ON A BETTER DATASET WE COULD HAVE ACHIEVED A ROCK SOLID ANSWER! BUT FOR A BEGINNER THIS IS GREAT!
Here we can see that based on our test input the patient has a greater than 50% chance of diameter narrowing in a major blood vessel and hence has a slight risk of developing Heart Disease. The model backs this statement with an accuracy of 63%. It also consisted of a lot of missing attributes and bad data. We have replaced all the missing attributes with an outlier(-99999) and thus most of our data-points are treated as outliers affecting our overall accuracy. The accuracy is not that high but manageable for a beginner 🙂
THIS ARTICLE IS A SHORT INTRODUCTION TO THIS ALGORITHM AND CAN BE CONSIDERED AS K-NEAREST NEIGHBORS IN A NUTSHELL. SOME THINGS HAVE BEEN SIMPLIFIED IN ORDER TO BE UNDERSTOOD WELL.
To sum it all up, The KNearestNeighbors algorithm is useful in Classification of smaller chunks of data. It is a method that finds the euclidean distance between n points, where n is the value of K that the user specifies. This approach is useful in many cases. a kNN is a relatively easy type of algorithm in the ML Pipeline. We also created and used a kNN to classify Heart Disease Patients.
Hope you learnt something new today and enjoyed today’s session. In the next blog post we will be discussing about a very simple topic in the ML Pipeline, Decision Trees.
I hope you understood what i tried to convey. Thank you for spending your valuable time, Until then Have a nice day and enjoy Machine Learning! 🙂