K-Nearest Neighbors

5 min readJul 1, 2021

K-Nearest Neighbors (aka KNN) is built of distance. In other words, we compute the similarity of each data point to all other existing data points in a data set by calculating its distance to others to realize which data point is closer to that specific data point.

To calculate the distance between these data points, there are a vast variety of distance calculations, we are going to study a bunch of distance metrics that can be useful for the KNN algorithm.

Manhattan Distance:

The Manhattan distance is the sum of the absolute values of cartesian coordinates (just the sum of numbers, not computationally heavy). You can think of Manhattan distance as moving through grids. The equation for calculating Manhattan distance is:

You can also visually watch Manhattan distance in the following picture:

The red path is the best to describe Manhattan distance from point a to point b

Euclidean Distance:

Euclidean distance is the shortest distance between two points and is the most commonly used distance metric. General Formula for n-dimensional calculation:

It might look familiar because the Pythagorean theorem uses Euclidean distance (computationally heavy for large datasets).

Minkowski Distance:

Minkowski distance is a generalized format for calculating distance in n-dimensional normed vector space. It takes on the formula of:

As you can see the Manhattan distance and Euclidean distance are special cases of the Minkowski distance. The Manhattan distance takes on the notation of L1 norm and Euclidean distance takes on the notation of L2 norm.

At this point, we took into consideration the most common distance metrics for KNN. Although there are many other ways to compute distances between datapoints including cosine similarity, Jaccard, …, these metrics are useful for other algorithms. For instance, Jaccard is useful for recommendation systems while cosine is more useful for text processing.

The reason I’m talking about different distances is that in lower dimensions you will get similar results by each of them (Distance Metrics) but once you have more dimensions, different distance metrics might give you different orders of which points are closest versus nearest from the point that you are looking.

The Algorithm, How It Works and Its Characteristics

this algorithm has certain characteristics:

- Nonparametric: It doesn’t make any assumption of the underlying data distribution.

- Lazy: this algorithm is lazy because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead. In other words, KNN puts all data points in space and memorizes where all the training data is and it is going to compute distances over and over again.

- Used for both regression and classification

Note: discriminative function means not learning patterns within your training data. For example, logistic regression learns the relationship between its log odds and different features or linear regression learns the relationship between features and target.

The algorithm ascribes class membership based on the closest K neighbors to the testing observation concerned. KNN is a similarity-based algorithm. The class membership of the observation depends on the feature similarity to training observations.

Consider the image below:

We are going to classify the green circle based on the closest point to it. In KNN, you first get to decide how many of its neighbors you want to consider and make a classification.

Based on the picture let’s say I just want to consider 1 nearest neighbor. The nearest neighbor in this 2D example is the blue square which is class 1 and because it is the closest data point, you can sort of assuming that the green circle is very similar to the blue square. Therefore, it’s class 1.

If you are using a different number of neighbors your result can change. For instance, if we change K from 1 to 3, the three closest data points would be one blue square and two red triangles, so we can classify the green circle as class 2. However, you can adjust this decision. For example, if a point is closer weighted higher and if it’s further away weighted lower, which makes sense since you want to consider your closest neighbors a little bit more than your neighbors that are a little further out.

As mentioned above, the similarity depends on the distance of observations from each other, which can be calculated using any of the distance metrics mentioned above. The most commonly used distance metric is the Euclidean distance. However, which distance metric you use as well as the value of K could affect the performance of the model. The decision boundary can change accordingly to different values of K.

How does the number of K affect the bias-variance tradeoff?

Note: bias-variance tradeoff is a tradeoff between overfitting and underfitting. Typically, the higher your k is the less you overfit your data. The following diagram shows how different numbers of k affect your decision bound:

We can see when k=1 the decision boundary is not as smooth and we usually witness overfitting in this case. As your k gets bigger and you considering more data points around it you kind of address overfitting in that way.

There is a nice illustration done by Chris Albon in this regard: