CS250 Study Guide

Unit 9: Data Mining III – Statistical Modeling

9a. Explain linear regression concepts 

  • What is linear regression?
  • What quantities are of importance when performing linear regression?
  • What are some characteristics of linear regression computations?

Linear regression is a supervised linear modeling technique for finding an optimal linear fit to a dataset. Simple linear regression involves relating one dependent variable to one independent variable by determining a slope and intercept. Multiple linear regression involves more than one independent variable. An intercept value can be optional as it can be absorbed into the solution of the linear regression equations. The scikit-learn module produces the intercept, the statsmodels module allows for either explicit intercept computation or absorbing the constant into the regression equations. The optimal fit is measured by solving a statistical least squares problem where the expected value of the squared error is minimized.

Important quantities derived from linear regressions are the residuals (the difference between the observed values and the estimates) and the correlation coefficient (which can be between -1 and 1). A value near zero implies no correlation and that a linear model as an explanation of the data is unlikely. A positive correlation coefficient implies the dependent variable increases as the independent variable increases. A negative correlation coefficient implies the dependent and independent variables are inversely correlated. For the purposes of this course, the coefficient of determination (also known as the R2 score) is the square of the correlation coefficient. You should be aware of the equations for these quantities. Additionally, you should have some intuition regarding the trend of the data as it relates to the sign of the slope for simple linear regression. Lastly, for simple linear regression, you should know why the regression line passes through the mean values of the dependent and independent variables.

To review, see Linear Regression and Residuals.

 

9b. Apply the scikit-learn module to build linear regression models

  • What is the scikit-learn instantiation for linear regression?
  • What are important computational details for fitting a linear regression model?
  • What are key output attributes for implementing a scikit-learn linear regression?

The scikit-learn module offers the capacity to create linear regression models using the instantiation linear_model.LinearRegression. When fitting a model, it is important to be aware of the subtleties of how data is to be arranged. Since the data matrix for the independent variable is assumed to have dimensions (number of observations)-by-(number of variables), it must have two dimensions. In other words, if there is one dependent variable, it cannot be in the form of a numpy vector (that is, a single subscript array). If the data is in the form of a vector, it must first be reshaped using the reshape method before fitting the model.

With regard to the output of the linear regression model, it is important to know the syntax for referring to key quantities. The regression coefficients and the intercept can be referenced using the coef_ and intercept_ attributes. For multiple linear regression, the coef_ attribute will be in the form of a vector. For small datasets, using simple linear regression, you should be able to demonstrate your intuition by estimating the slope and the intercept to verify the output of a linear regression. Lastly, the coefficient of determination can be computed using the r2_score method contained within the sklearn.metrics class.

To review, see Linear Regression.

 

9c. Apply the scikit-learn module to validate linear regression models 

  • What methods are important for model validation?
  • What is cross-validation?
  • How is cross-validation implemented in practice?

As is customary for many machine learning modules, the predict and score methods can be applied to test data after fitting the model. It is important to note that the linear model can both interpolate within the bounds of the training set and extrapolate outside the bounds of the training set. It is, therefore, up to the data scientist to interpret the predictions. For example, if a simple linear regression can model the cost of some item, it would be unrealistic to consider values less than zero (even though the regression would be fine with processing such values).

Now that various topics on supervised and unsupervised learning have been covered, it is sensible to consider constructing more sophisticated model validation approaches such as cross-validation. In this approach, a test (or "validation") set is extracted from an overall dataset and is separate from the training set. The test and training data are selected randomly; therefore, several random tests can be performed to arrive at a statistical estimate of the model performance. You should feel comfortable in your Python programming capacity to construct cross-validation tests.

To review, see Linear Regression and Cross-Validation.

 

9d. Explain data overfitting 

  • What is overfitting?
  • What is underfitting?
  • What steps can be taken to avoid overfitting a model?

After a model has been trained, objective measures must be in place to ensure acceptable performance, and the data scientist must be able to spot when something is awry. Underfitting occurs when a mathematical model cannot adequately capture the underlying structure of the data. This can happen if the dataset is too complex for the model. Overfitting occurs when a model has too many parameters to be determined. In this case, the model is trained too well to the dataset and loses its ability to generalize. This means a quantity such as the mean squared training error could be relatively low while the model still cannot accurately handle data outside the training set.

To avoid overfitting and underfitting, you can apply cross-validation techniques. For example, it is common to partition computational training steps into epochs where a validation set is tested at the end of each epoch to ensure a given model retains its ability to generalize during the training session. In this way, the model is constantly being tested to avoid underfitting or overfitting, and both the training error and the validation error are kept small. For simple examples such as the two-class classification problem, you should be able to visualize a well-fitted versus an overfitted decision boundary between the class datasets.

To review, see Overfitting.

 

Unit 9 Vocabulary 

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • coefficient of determination
  • coef_
  • correlation coefficient
  • cross-validation
  • intercept_
  • linear_model.LinearRegression
  • overfitting
  • r2_score
  • residual
  • underfitting