CS250 Study Guide

Unit 7: Data Mining I – Supervised Learning

7a. Explain supervised learning techniques

  • What is supervised learning?
  • How are distance functions used in learning systems?
  • What is overfitting versus underfitting?
  • What is the main goal of supervised learning?

Supervised learning attempts, using training data, to create a mapping between a set of input data and a set of desired output targets arranged in the form of training pairs. A spectrum of techniques exists, ranging from statistical techniques such as regression to classical techniques such as the Bayes' decision and k-nearest neighbors to deep learning neural networks. While the field is vast, some general concepts are common to learning systems. For instance, it is often important to measure the distance between points using a distance function such as the Euclidean distance or the Manhattan distance. Such functions can be useful in the classification problem, where a data point can be classified by determining the minimum distance to a given class.

In general, the main goal of supervised learning is to fit a specific model using the training data. Different supervised learning algorithms attempt to achieve this goal in different ways. The k-nearest-neighbors algorithm achieves this goal by applying a distance function. Logistic regression achieves this goal by applying a linear regression and then a logistic function. Decision trees achieve this goal by creating a tree of rules based on the distribution of the training data. During the training process, one must avoid underfitting and overfitting a model. While this topic is expounded upon in a later module, it is important at the outset of any machine learning lesson to have a qualitative grasp of these concepts. Underfitting occurs when a mathematical model cannot adequately capture the underlying structure of the data. Overfitting occurs when the learning model fits so perfectly to the training data that it cannot generalize to points outside the training set. Finally, it is important to be aware of the bias-variance tradeoff. Bias is the difference between the average prediction of a model and the correct value the model is trying to predict. Variance is the variability of a model prediction for a given data point when applied over a sample of training sets. A reliable supervised training algorithm should find an optimal combination between these quantities.

To review, see Data Mining Overview and Supervised Learning.

 

7b. Apply methods in the scikit-learn module to supervised learning 

  • What are the syntax details for implementing k-nearest neighbors?
  • What are the syntax details for implementing decision trees?
  • What are the syntax details for implementing logistic regression?

This course has chosen to focus on Python implementations of decision trees, logistic regression, and k-nearest neighbors to help you begin your journey in machine learning using the scikit-learn module. Common to many Python machine learning modules are the fit, predict and score methods. The fit method will fit a model based upon which technique has been instantiated. When applying supervised learning, the score method can quantitatively compare a supervised test set against the model's prediction using, for example, the mean squared error. The predict method will yield the model output given a specific set of inputs. The train_test_split method is extremely useful for creating training and test sets from a larger dataset.

When it comes to instantiations such as linear_model.LogisticRegression, neighbors.KNeighborsClassifier, and tree.DecisionTreeClassifier, you must be very clear about how to fit these models. While the fit syntax is consistent across such models, their input parameters will obviously vary. For example, the input parameter criterion for choosing the decision strategy is specific to the decision tree, and the input parameter p for choosing the metric in k-nearest neighbors is specific to the k-nearest-neighbors algorithm. Take some time to review the respective input parameters and output attributes.

To review, see k-Nearest Neighbors, Decision Trees, and Logistic Regression.

 

7c. Implement Python scripts that extract features and reduce feature dimension 

  • What is feature extraction?
  • Why is dimensionality reduction useful?
  • What preprocessing steps can be helpful to the feature extraction process?

After data is collected for a given data science application, the dataset is usually structured as a set of variables and observations. After each observation, the dataset is organized as a set of variables or 'features' after a set of processing steps. After a set of features is arrived at, the dataset is often subjected to preprocessing steps such as feature scaling or feature normalization to attribute equivalent weight to each feature. For example, instantiating preprocessing.MinMaxScaler will enable the scaling of all feature magnitudes between a minimum and maximum range (where the default range is between zero and one) and preprocessing.StandardScaler can be used to normalize features. The process of taking raw data and converting it into a set of scaled or normalized features is known as feature extraction.

When a large number of features (such as more than 10) is derived from a dataset, it is important to consider techniques that can reduce the dimensionality of the data. This is partly due to what some term the curse of dimensionality, where, as the dimension of the data (that is, the number of features) increases, all data points appear as if they are equidistant. In other words, the concept of a distance function for the classification problem is rendered ineffective in higher dimensions. Such a conclusion necessitates using methods for dimensionality reduction such as principal component analysis (which can be implemented in Python by invoking decomposition.PCA) or PCA. PCA is an eigenvector decomposition from the dataset that can determine which components affect the dataset most. Therefore, you must understand output attributes such as explained_variance_ratio_ and singular_values_, which describe each eigenvector component's 'strength' or contribution to the original feature set.

To review, see Principal Component Analysis.

 

7d. Train and evaluate models using data mining techniques 

  • What methods and techniques are applied when training and evaluating learning models?
  • What method can compute the mean accuracy of a model response to a test set?
  • How is the argmax calculation applied in pattern classification problems?

The training and evaluation of models using the scikit-learn module are generally accomplished using the train_test_split, fit, predict and score methods. Visualization techniques such as scatter, distribution, matrix, and regression plots (using, for example, matplotlib and seaborn) can also be of great use for evaluating, for example, classification performance. The score method computes the mean accuracy of a model response to a test set. If only the model output is desired, then the predict method is more appropriate.

When it comes to training and evaluating models, it is helpful to have some intuition regarding the output of a given technique. For example, for a method such as k-nearest neighbors, you should be able to predict a classifier output given a two-class problem on the real line with a small set of observations. It is also helpful to be familiar with the numpy argmax method for determining an index where a maximum value can be found within a numerical list or a vector. In classification problems, one can be equally concerned with the argmax location and the actual maximum value, which could be used as a confidence measure. Finally, it is often the case that one may want to optimally tune a parameter or choice of parameters for a given training set. An optimal search of this kind can be helped using the GridSearchCV method.

To review, see Training and Testing.

 

Unit 7 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • argmax
  • Bayes' decision
  • bias-variance tradeoff
  • classification problem
  • decision tree
  • decomposition.PCA
  • dimensionality reduction
  • distance function
  • Euclidean distance
  • explained_variance_ratio_
  • feature extraction
  • feature normalization
  • feature scaling
  • fit
  • GridSearchCV
  • k-nearest neighbors
  • linear_model.LogisticRegression
  • logistic regression
  • Manhattan distance.
  • minimum distance
  • neighbors.KNeighborsClassifier
  • overfitting
  • predict
  • preprocessing.MinMaxScaler
  • preprocessing.StandardScaler
  • principal component analysis (PCA)
  • score
  • singular_values_
  • supervised learning
  • train_test_split
  • training pairs
  • tree.DecisionTreeClassifier
  • underfitting