Classification, Decision Trees, and k-Nearest-Neighbors

Follow this tutorial to learn how to implement the decision tree. These materials also conveniently review k-NN and discuss the pros and cons of each algorithm.

Introduction

Before we dive into the material for this week's article, let's talk about the kind of problem that we are going to solve and its place in the exciting field of machine learning. T. Mitchell's book "Machine Learning" (1997) gives a classic, general definition of machine learning as follows:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In the various problem settings, T, P, and E can refer to completely different things. Some of the most popular tasks T in machine learning are the following:

  • classification of an instance to one of the categories based on its features;
  • regression – prediction of a numerical target feature based on other features of an instance;
  • clustering – identifying partitions of instances based on the features of these instances so that the members within the groups are more similar to each other than those in the other groups;
  • anomaly detection – search for instances that are "greatly dissimilar" to the rest of the sample or to some group of instances;
  • and so many more.


Experience E refers to data (we can't go anywhere without it). Machine learning algorithms can be divided into those that are trained in a supervised or unsupervised manner. In unsupervised learning tasks, one has a set consisting of instances described by a set of features. In supervised learning problems, there's also a target variable, which is what we would like to be able to predict, known for each instance in a training set.


Example

Classification and regression are supervised learning problems. For example, as a credit institution, we may want to predict loan defaults based on the data accumulated about our clients. Here, the experience E is the available training data: a set of instances (clients), a collection of features (such as age, salary, type of loan, past loan defaults, etc.) for each, and a target variable (whether they defaulted on the loan). This target variable is just a fact of a loan default (1 or 0), so recall that this is a (binary) classification problem. If you were instead predicting by how much time the loan payment is overdue, this would become a regression problem.

Finally, the third term used in the definition of machine learning is a metric of the algorithm's performance evaluation P. Such metrics differ for various problems and algorithms, and we'll discuss them as we study new algorithms. For now, we'll refer to a simple metric for classification algorithms, the proportion of correct answers – accuracy – on the test set.

Let's take a look at two supervised learning problems: classification and regression


Source: Yury Kashnitsky, https://web.archive.org/web/20210922024556/https://mlcourse.ai/articles/topic3-dt-knn/
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.