Topic: Unit 7: Data Mining I – Supervised Learning | CS250: Python for Data Science | Saylor Academy

Data mining attempts to find patterns and relationships within and between given data sets. The field of data mining is vast, so we have broken down its introduction into two units: supervised and unsupervised learning. We will then move on to statistical model-building. When you finish this unit, you will be able to implement learning systems fundamental to the field of data mining.

This unit discusses the basics of supervised learning, feature extraction, dimensionality reduction, and training and testing of supervised learning models. We will focus on benchmark models fundamental to data mining, such as Bayes' decision and K-nearest neighbor. We will implement them using the scikit-learn module. Understanding these methods will prepare you for future excursions into machine learning and deep learning.

Completing this unit should take you approximately 11 hours.

Select activity Upon successful completion of this unit, you will ...
Upon successful completion of this unit, you will be able to:

explain supervised learning techniques;

apply methods in the scikit-learn module to supervised learning;

implement Python scripts that extract features and reduce feature dimension; and

train and evaluate models using data mining techniques.

7.1: Data Mining Overview
- Select activity Introduction to Data Mining
  
  Introduction to Data Mining Book
  
  Students must
  
  Mark as done
  
  Data mining involves various algorithms and techniques for database searching, inferring data relationships, pattern recognition, and pattern classification. Pattern recognition is the process of comparing a sample observation against a fixed set of patterns (like those stored in a database) to search for an optimal match. Face recognition, voice recognition, character recognition, fingerprint recognition, and text string matching are all examples of pattern searching and pattern recognition.
  
  Going one step further, given a set of prescribed pattern classes, pattern classification is the process of associating a sample observation with one of the pattern classes. For example, consider a database containing two possible classes of face images: happy faces and sad faces. Pattern classification involves processing an input face image of unknown classification to optimally classify it as either happy or sad. As you will soon see, the optimal pattern match or pattern classification is often defined using probabilistic and statistical measures such as distances, deviations, and confidence intervals.
- Select activity Introduction to Machine Learning
  
  Introduction to Machine Learning Page
  
  Students must
  
  Mark as done
  
  Machine learning is the aspect of data mining that applies algorithms for learning and inferring relationships within empirical data sets. Since machine learning often involves pattern searching and classification, it is a broad subject that encompasses several approaches for constructing data learning and inference models.
- Select activity Bayes' Theorem
  
  Bayes' Theorem Page
  
  Students must
  
  Mark as done
  
  Pattern search and classification problems often involve the application of observing data subject to some set of conditions. Study the relationship between conditional probability and Bayes' Theorem as it is the foundational material for data mining.
- Select activity Bayes' Theorem and Conditional Probability
  
  Bayes' Theorem and Conditional Probability Page
  
  Students must
  
  Mark as done
  
  Here are more examples of applying Bayes' Theorem and conditional probability to data mining.
- Select activity Methods for Pattern Classification
  
  Methods for Pattern Classification Book
  
  Students must
  
  Mark as done
  
  At the heart of all pattern search or classification problems (either explicitly or implicitly) lies Bayes' Decision Theory. Bayes' decision simply says, given an input observation of unknown classification, make the decision that will minimize the probability of a classification error. For example, in this unit, you will be introduced to the k-nearest neighbor algorithm. It can be demonstrated that this algorithm can make Bayes' decision. Read this chapter to familiarize yourself with Bayes' decision.
7.2: Supervised Learning
- Select activity Supervised learning
  
  Supervised learning Page
  
  Students must
  
  Mark as done
  
  A set, collection, or database of either pattern or class data is generically referred to as "training data". This is because data mining requires a collection of known or learned examples against which input observations can be compared. For pattern classification, as mentioned in the previous section, there are two broad categories of learned examples: supervised and unsupervised. This unit deals specifically with supervised learning techniques, while the next unit deals with unsupervised learning techniques. Read these basic steps of solving a supervised learning problem. Assuming data has been collected, as this unit progresses, you will understand and be able to implement the process:
  Training set → Feature selection → Training algorithm → Evaluate model
  Our tool for these implementing steps will be the scikit-learn module.
- Select activity Feature Selection
  
  Feature Selection Page
  
  Students must
  
  Mark as done
  
  Feature selection (or "feature extraction") is the process of taking raw training data and defining data features that represent important characteristics of the data. For example, consider an image recognition application. An image can contain millions of pixels. Yet, our eyes key into specific features that allow our brains to recognize objects within an image. Object edges within an image are key features that can define the shape of an object. The original image consisting of millions of pixels can therefore be reduced to a much smaller set of edges.
- Select activity Model Inspection and Feature Selection
  
  Model Inspection and Feature Selection Page
  
  Students must
  
  Mark as done
  
  Once a set of features is chosen, a model must be trained and evaluated. Based on these materials, you should now understand how data mining works. The rest of this unit will introduce some practical techniques and their implementations using scikit-learn.
- Select activity scikit-learn
  
  scikit-learn Page
  
  Students must
  
  Mark as done
  
  The scikit-learn module contains a broad set of methods for statistical analyses and basic machine learning. During the remainder of this unit and the next on unsupervised learning, we will introduce scikit-learn in the context of data mining applications. Use this section as an introduction to see how modules such as pandas can be used in conjunction with the sci-kit learn module. Make sure to follow along with the programming examples. There is no substitute for learning by doing. As this course progresses, you will understand more deeply how to apply the methods used in this video.
7.3: Principal Component Analysis
- Select activity Dimensionality Reduction
  
  Dimensionality Reduction Page
  
  Students must
  
  Mark as done
  
  As part of the feature optimization process, when faced with a large set of features, a major goal is to determine a combination or mixture of features that lead to optimal model evaluations. You have already seen the subset selection approach, which you can use to reduce the number of features used to describe training set observations. Using the language of vectors and matrices, we can say that we can reduce the dimension of a feature vector if a subset of features is found to give optimal results. Reducing the feature vector dimension is preferable because it directly translates into reduced time to train a given model. Additionally, higher dimensional spaces impede the ability to define distances, as all points in the space begin to appear as if they are all equally close together.
- Select activity Principal Component Analysis
  
  Principal Component Analysis Book
  
  Students must
  
  Mark as done
  
  Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal.
- Select activity PCA in Python
  
  PCA in Python Page
  
  Students must
  
  Mark as done
  
  In this section, you will learn how to implement and apply PCA for feature optimization and dimensionality reduction using scikit-learn.
7.4: k-Nearest Neighbors
- Select activity The k-Nearest Neighbors Algorithm
  
  The k-Nearest Neighbors Algorithm Page
  
  Students must
  
  Mark as done
  
  The k-nearest neighbor (k-NN) algorithm attempts to classify an input feature vector by finding the k closest neighbors in a set of predefined classes. Using the word "closest" automatically means that you must choose some measure of distance to decide the class membership.
- Select activity Using the k-NN Algorithm
  
  Using the k-NN Algorithm Page
  
  Students must
  
  Mark as done
  
  With your current understanding, it is time to implement the k-NN algorithm using scikit-learn. Follow along with this example to gain programming experience.
- Select activity Nearest Neighbors
  
  Nearest Neighbors Page
  
  Students must
  
  Mark as done
  
  Study this example in depth. Notice it uses the same dataset as the previous example; however, the approach to building the data sets differs. It is important to see different perspectives on solving the same problem.
7.5: Decision Trees
- Select activity Dealing with Uncertainty
  
  Dealing with Uncertainty Book
  
  Students must
  
  Mark as done
  
  A decision tree is a model of decisions and their outcomes. It has found widespread application because of its ease of implementation. Additionally, its compact tree representation is often useful for visualizing the breadth of possible outcomes.
- Select activity Classification, Decision Trees, and k-Nearest-Neighbors
  
  Classification, Decision Trees, and k-Nearest-Neighbors Book
  
  Students must
  
  Mark as done
  
  Follow this tutorial to learn how to implement the decision tree. These materials also conveniently review k-NN and discuss the pros and cons of each algorithm.
- Select activity Decision Trees
  
  Decision Trees Page
  
  Students must
  
  Mark as done
  
  It is always best to see different programming examples when learning a new topic. There is no substitute for practice.
7.6: Logistic Regression
- Select activity Logistic Regression
  
  Logistic Regression Page
  
  Students must
  
  Mark as done
  
  Logistic regression is a nonlinear modification to linear regression. The logistic function often arises in machine learning applications.
- Select activity More on Logistic Regression
  
  More on Logistic Regression Page
  
  Students must
  
  Mark as done
  
  Here is an introductory example of how to apply scikit-learn to implement logistic regression. As you follow this programming example, make sure you understand how the variable definitions relate to the algorithm.
- Select activity Implementing Logistic Regression
  
  Implementing Logistic Regression Page
  
  Students must
  
  Mark as done
  
  This video gives an example of implementing logistic regression. Given the K-NN, decision tree, and logistic regression classifiers, you should begin to see a theme arising based on the supervised learning pipeline. In the next section, we will complete the pipeline by exercising model evaluations using the techniques we've discussed.
7.7: Training and Testing
- Select activity Supervised Learning and Model Validation
  
  Supervised Learning and Model Validation Page
  
  Students must
  
  Mark as done
  
  A final step in the supervised learning process is to evaluate a trained model using data not contained within the training set. Use this video to practice programming examples involving training and testing.
- Select activity Training and Tuning a Model
  
  Training and Tuning a Model Page
  
  Students must
  
  Mark as done
  
  Use this video to practice the concepts presented in this unit. This material is crucial as it combines all the steps outlined in the supervised learning section so that they can be implemented using scikit-learn. We will cover implementing linear regression in an upcoming unit. For now, use these examples to learn how to implement and evaluate the different machine learning approaches covered in this unit.
Unit 7 Assessment
- Select activity Unit 7 Assessment
  Unit 7 Assessment Quiz
  
  Students must
  
  Receive a grade
  
  Take this assessment to see how well you understood this unit.
  
  This assessment does not count towards your grade. It is just for practice!
  
  You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
  
  You can take this assessment as many times as you want, whenever you want.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python