Data mining attempts to find patterns and relationships within and between given data sets. The field of data mining is vast, so we have broken down its introduction into two units: supervised and unsupervised learning. We will then move on to statistical model-building. When you finish this unit, you will be able to implement learning systems fundamental to the field of data mining.
This unit discusses the basics of supervised learning, feature extraction, dimensionality reduction, and training and testing of supervised learning models. We will focus on benchmark models fundamental to data mining, such as Bayes' decision and K-nearest neighbor. We will implement them using the scikit-learn module. Understanding these methods will prepare you for future excursions into machine learning and deep learning.
Completing this unit should take you approximately 11 hours.
Data mining involves various algorithms and techniques for database searching, inferring data relationships, pattern recognition, and pattern classification. Pattern recognition is the process of comparing a sample observation against a fixed set of patterns (like those stored in a database) to search for an optimal match. Face recognition, voice recognition, character recognition, fingerprint recognition, and text string matching are all examples of pattern searching and pattern recognition.
Going one step further, given a set of prescribed pattern classes, pattern classification is the process of associating a sample observation with one of the pattern classes. For example, consider a database containing two possible classes of face images: happy faces and sad faces. Pattern classification involves processing an input face image of unknown classification to optimally classify it as either happy or sad. As you will soon see, the optimal pattern match or pattern classification is often defined using probabilistic and statistical measures such as distances, deviations, and confidence intervals.
Machine learning is the aspect of data mining that applies algorithms for learning and inferring relationships within empirical data sets. Since machine learning often involves pattern searching and classification, it is a broad subject that encompasses several approaches for constructing data learning and inference models.
Pattern search and classification problems often involve the application of observing data subject to some set of conditions. Study the relationship between conditional probability and Bayes' Theorem as it is the foundational material for data mining.
Here are more examples of applying Bayes' Theorem and conditional probability to data mining.
At the heart of all pattern search or classification problems (either explicitly or implicitly) lies Bayes' Decision Theory. Bayes' decision simply says, given an input observation of unknown classification, make the decision that will minimize the probability of a classification error. For example, in this unit, you will be introduced to the k-nearest neighbor algorithm. It can be demonstrated that this algorithm can make Bayes' decision. Read this chapter to familiarize yourself with Bayes' decision.
A set, collection, or database of either pattern or class data is generically referred to as "training data". This is because data mining requires a collection of known or learned examples against which input observations can be compared. For pattern classification, as mentioned in the previous section, there are two broad categories of learned examples: supervised and unsupervised. This unit deals specifically with supervised learning techniques, while the next unit deals with unsupervised learning techniques. Read these basic steps of solving a supervised learning problem. Assuming data has been collected, as this unit progresses, you will understand and be able to implement the process:
Training set → Feature selection → Training algorithm → Evaluate model
Our tool for these implementing steps will be the scikit-learn module.
Once a set of features is chosen, a model must be trained and evaluated. Based on these materials, you should now understand how data mining works. The rest of this unit will introduce some practical techniques and their implementations using scikit-learn.
The scikit-learn module contains a broad set of methods for statistical analyses and basic machine learning. During the remainder of this unit and the next on unsupervised learning, we will introduce scikit-learn in the context of data mining applications. Use this section as an introduction to see how modules such as pandas can be used in conjunction with the sci-kit learn module. Make sure to follow along with the programming examples. There is no substitute for learning by doing. As this course progresses, you will understand more deeply how to apply the methods used in this video.
As part of the feature optimization process, when faced with a large set of features, a major goal is to determine a combination or mixture of features that lead to optimal model evaluations. You have already seen the subset selection approach, which you can use to reduce the number of features used to describe training set observations. Using the language of vectors and matrices, we can say that we can reduce the dimension of a feature vector if a subset of features is found to give optimal results. Reducing the feature vector dimension is preferable because it directly translates into reduced time to train a given model. Additionally, higher dimensional spaces impede the ability to define distances, as all points in the space begin to appear as if they are all equally close together.
Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal.
In this section, you will learn how to implement and apply PCA for feature optimization and dimensionality reduction using scikit-learn.
The k-nearest neighbor (k-NN) algorithm attempts to classify an input feature vector by finding the k closest neighbors in a set of predefined classes. Using the word "closest" automatically means that you must choose some measure of distance to decide the class membership.
With your current understanding, it is time to implement the k-NN algorithm using scikit-learn. Follow along with this example to gain programming experience.
Study this example in depth. Notice it uses the same dataset as the previous example; however, the approach to building the data sets differs. It is important to see different perspectives on solving the same problem.
A decision tree is a model of decisions and their outcomes. It has found widespread application because of its ease of implementation. Additionally, its compact tree representation is often useful for visualizing the breadth of possible outcomes.
Follow this tutorial to learn how to implement the decision tree. These materials also conveniently review k-NN and discuss the pros and cons of each algorithm.
It is always best to see different programming examples when learning a new topic. There is no substitute for practice.
Logistic regression is a nonlinear modification to linear regression. The logistic function often arises in machine learning applications.
Here is an introductory example of how to apply scikit-learn to
implement logistic regression. As you follow this programming example,
make sure you understand how the variable definitions relate to the
algorithm.
This video gives an example of implementing logistic regression. Given the K-NN, decision tree, and logistic regression classifiers, you should begin to see a theme arising based on the supervised learning pipeline. In the next section, we will complete the pipeline by exercising model evaluations using the techniques we've discussed.
A final step in the supervised learning process is to evaluate a trained model using data not contained within the training set. Use this video to practice programming examples involving training and testing.
Use this video to practice the concepts presented in this unit. This material is crucial as it combines all the steps outlined in the supervised learning section so that they can be implemented using scikit-learn. We will cover implementing linear regression in an upcoming unit. For now, use these examples to learn how to implement and evaluate the different machine learning approaches covered in this unit.
Take this assessment to see how well you understood this unit.