DataMaestro: Difference Between K-Means and K-Nearest Neighbor Algorithms

KNN

K-Means

kmeans is unsupervised learning and for clustering. Knn is supervised learning for classification.

When you look at the names of KNN and Kmeans algorithms you may what to ask if Kmeans is related to the k-Nearest Neighbors algorithm? And one could make the mistake of saying they’re related after all they both have "k" in their names and logically that they're both machine learning algorithms, that is finding ways to label things, even though not the same types of things. But again the k's refer to completely different things. And that "k" in kmeans has absolutely nothing to do with the "k" in knn

K-Nearest Neighbors:

The k-nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a bunch of labeled points and uses them to learn how to label other points. To label a new point, it looks at the labeled points closest to that new point those are its nearest neighbors, and has those neighbors vote, so whichever label the most of the neighbors have is the label for the new point the "k" is the number of neighbors it checks. It is supervised because you are trying to classify a point based on the known classification of other points. For example

If I have a dataset of Soccer players, their positions, and their measurements, and I want to assign positions to Soccer players in a new dataset where I have measurements but no positions, I might use k-nearest neighbors.

K-means:

The k-means algorithm is a clustering algorithm, and it is unsupervised: it takes a bunch of unlabeled points and tries to group them into clusters the "k" is the number of clusters.

It is unsupervised because the points have no external classification.

The k in k-means means the number of clusters I want to have in the end. If k = 5, I will have 5 clusters, or distinct groups, of Soccer players after I run the algorithm on my dataset.
For example if I have a dataset of Soccer players who need to be grouped into k distinct groups based off of similarity, I might use k-means.

Correspondingly, the K in each case also means different things! In k-nearest neighbors, the k represents the number of neighbors who have a vote in determining a new player's position. Take the example where k =4. If I have a new Soccer player who needs a position, I take the 4 Soccer players in my dataset with measurements closest to my new Soccer player, and I have them vote on the position that I should assign the new player.

In summary, they are two different algorithms with two very different end results, but the fact that they both use k can be very confusing!

DataMaestro

Difference Between K-Means and K-Nearest Neighbor Algorithms

No comments:

Post a Comment