Classification, Decision Trees, and k-Nearest-Neighbors

Follow this tutorial to learn how to implement the decision tree. These materials also conveniently review k-NN and discuss the pros and cons of each algorithm.

Nearest Neighbors Method

The nearest neighbors method (k-Nearest Neighbors, or k-NN) is another very popular classification method that is also sometimes used in regression problems. This, like decision trees, is one of the most comprehensible approaches to classification. The underlying intuition is that you look like your neighbors. More formally, the method follows the compactness hypothesis: if the distance between the examples is measured well enough, then similar examples are much more likely to belong to the same class.

According to the nearest neighbors method, the green ball would be classified as "blue" rather than "red".


For another example, if you do not know how to tag a Bluetooth headset on an online listing, you can find 5 similar headsets, and if 4 of them are tagged as "accessories" and only 1 as "Technology", then you will also label it under "accessories".

To classify each sample from the test set, one needs to perform the following operations in order:

  1. Calculate the distance to each of the samples in the training set.
  2. Select $k$ samples from the training set with minimal distance to them.
  3. The class of the test sample will be the most frequent class among those $k$ nearest neighbors.

The method adapts quite easily for the regression problem: in step 3, it returns not the class but the number – a mean (or median) of the target variable among neighbors.

A notable feature of this approach is its laziness – calculations are only done during the prediction phase when a test sample needs to be classified. No model is constructed from the training examples beforehand. In contrast, recall that for decision trees in the first half of this article, the tree is constructed based on the training set, and the classification of test cases occurs relatively quickly by traversing through the tree.

Nearest neighbors are a well-studied approach. There exist many important theorems claiming that, on "endless" datasets, it is the optimal method of classification. The authors of the classic book "The Elements of Statistical Learning" consider k-NN to be a theoretically ideal algorithm whose usage is only limited by computation power and the curse of dimensionality.


Nearest Neighbors Method in Real Applications

  • k-NN can serve as a good starting point (baseline) in some cases;
  • In Kaggle competitions, k-NN is often used for the construction of meta-features (i.e. k-NN predictions as input to other models) or for stacking/blending;
  • The nearest neighbors method extends to other tasks like recommendation systems. The initial decision could be a recommendation of a product (or service) that is popular among the closest neighbors of the person for whom we want to make a recommendation;
  • In practice, on large datasets, approximate methods of search are often used for nearest neighbors. There is a number of open source libraries that implement such algorithms; check out Spotify's library, Annoy.

The quality of classification/regression with k-NN depends on several parameters:

  • The number of neighbors $k$.
  • The distance measured between samples (common ones include Hamming, Euclidean, cosine, and Minkowski distances). Note that most of these metrics require data to be scaled. Simply speaking, we do not want the "salary" feature, which is on the order of thousands, to affect the distance more than "age", which is generally less than 100.
  • Weights of neighbors (each neighbor may contribute different weights; for example, the further the sample, the lower the weight).

Class KNeighborsClassifier in Scikit-learn

The main parameters of the class sklearn.neighbors.KNeighborsClassifier are:

  • weights: uniform (all weights are equal), distance (the weight is inversely proportional to the distance from the test sample), or any other user-defined function;
  • algorithm (optional): brute, ball_tree, KD_tree, or auto. In the first case, the nearest neighbors for each test case are computed by a grid search over the training set. In the second and third cases, the distances between the examples are stored in a tree to accelerate finding nearest neighbors. If you set this parameter to auto, the right way to find the neighbors will be automatically chosen based on the training set.
  • leaf_size (optional): threshold for switching to grid search if the algorithm for finding neighbors is BallTree or KDTree;
  • metric: minkowski, manhattan, euclidean, chebyshev, or other.