Classification, Decision Trees, and k-Nearest-Neighbors: Introduction

Introduction

Before we dive into the material for this week's article, let's talk about the kind of problem that we are going to solve and its place in the exciting field of machine learning. T. Mitchell's book "Machine Learning" (1997) gives a classic, general definition of machine learning as follows:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In the various problem settings, T, P, and E can refer to completely different things. Some of the most popular tasks T in machine learning are the following:

classification of an instance to one of the categories based on its features;
regression – prediction of a numerical target feature based on other features of an instance;
clustering – identifying partitions of instances based on the features of these instances so that the members within the groups are more similar to each other than those in the other groups;
anomaly detection – search for instances that are "greatly dissimilar" to the rest of the sample or to some group of instances;
and so many more.

Experience E refers to data (we can't go anywhere without it). Machine learning algorithms can be divided into those that are trained in a supervised or unsupervised manner. In unsupervised learning tasks, one has a set consisting of instances described by a set of features. In supervised learning problems, there's also a target variable, which is what we would like to be able to predict, known for each instance in a training set.

Example

Classification and regression are supervised learning problems. For example, as a credit institution, we may want to predict loan defaults based on the data accumulated about our clients. Here, the experience E is the available training data: a set of instances (clients), a collection of features (such as age, salary, type of loan, past loan defaults, etc.) for each, and a target variable (whether they defaulted on the loan). This target variable is just a fact of a loan default (1 or 0), so recall that this is a (binary) classification problem. If you were instead predicting by how much time the loan payment is overdue, this would become a regression problem.

Finally, the third term used in the definition of machine learning is a metric of the algorithm's performance evaluation P. Such metrics differ for various problems and algorithms, and we'll discuss them as we study new algorithms. For now, we'll refer to a simple metric for classification algorithms, the proportion of correct answers – accuracy – on the test set.

Let's take a look at two supervised learning problems: classification and regression

Source: Yury Kashnitsky, https://web.archive.org/web/20210922024556/https://mlcourse.ai/articles/topic3-dt-knn/
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python