KNN is a supervised algorithm that determines a data's label, based on the k nearest training examples in the dataset. This algorithm determines the data is more likely to be in a class by the training data most similar to our test data. In this project, we programmed the KNN algorithm manually. The main focus of this project was to show how KNN classifies data.
- dataset
The data we are dealing with has 400 points consisting of their x, y, and labels. Every class in the dataset is shown in a different color. Each point has a corresponding label to separate the classes.
Our data is shown in the diagram below:
- Load & split train and test data
With Pandas, you can read a CSV file and load the dataset as a dataframe. For training a KNN model we need both train and test data. We have some data to predict test dataset labels and then evaluate our code. To achieve that we used Scikit Learn's split data function.
- distance_calculator()
Machine learning data consists of vectors. KNN algorithm searches for similar training data. The most straightforward function to find the neighbors is the Euclidean distance function. - K_nearest_neighbour_classifier()
By knowing the distance between two points, K nearest neighbors are the training points with the least distance. This function would find the frequent labels in the neighbors and classify the test data point as the same label. - accuracy_calculator()
Because KNN is a supervised algorithm, we have the test labels. The number of correctly classified labels is divided by the total number of data. - data_plot()
The plotter will show the train and test data with their class labels in a 2D coordinate diagram. You can set the inputs for both supervised and predicted labels.
Setting KNN's hyperparameter is commonly achieved by the elbow chart. For each K from 3 to 20, you can check how different the loss will be and then you can pick the best number for tuning in your scenario.
For our dataset, K=13 has the least loss.
For training the dataset, we have a function to combine previously mentioned functions. We set K as input to set the number of naibours and train our model. This function will predict a label for each test datapoint by the end of training. To test how much the model predicted correctly, we call the accuracy calculator. Our KNN predicts over 88 percent of labels correctly.