There is more to data science than simply analyzing data. The ability to build a model from a data set implies that some deeper relationship amongst data observations has been captured. Now that we have covered the basics of data mining, it makes sense to consider statistical models that allow for inference and also have some predictive power.
This unit demonstrates how to apply the scikit-learn module to building regression models. We will also show how to interpret model parameters once a model has been constructed. This unit will teach you how to apply Python for creating regression models, drawing inferences, and making predictions using computed model parameters.
Completing this unit should take you approximately 5 hours.
Now that data mining algorithms and, in particular, supervised learning concepts have been covered, it is time to address the construction of statistical models. The subject of linear regression has been mentioned in a perfunctory way at several points throughout the course. In this unit, we will delve more deeply into this technique. In its simplest form, the goal is to optimally identify the slope and intercept for empirical data assumed to depend linearly upon some independent variable. Linear regression is a statistical supervised learning technique because training data for the independent variable is mapped to data associated with the dependent variable. Once the linear model is created, obtaining estimates for data not contained within the training set becomes possible. Ensure you understand the examples and associated calculations in the video, such as residuals, the correlation coefficient, and the coefficient of determination. Additionally, if necessary, you may want to review hypothesis testing and tests for significance introduced in the statistics unit. After this video, you will learn how to implement this technique using scikit-learn. However, as a programming exercise, you should feel confident in writing code to implement the regression equations.
This tutorial runs through the basics of scikit-learn syntax for linear regression. Pay close attention to how the data is generated for this example. Notice how list comprehension is used to create a list for the dependent and independent variables. Furthermore, the dependent variable list is formed by adding random Gaussian noise to each value in the independent variable list. These lists are then converted to numpy arrays to train the linear regression model. By construction, there exists a linear relationship between the independent and dependent variables. Linear regression is then used to identify the slope and intercept, which should match the empirical data. Finally, predictions of the dependent variable are made using independent variable data not contained in the training set.
Follow this example for more practice with linear regression implementation. Like the previous example, by construction, data is generated having a linear relationship. However, notice that the data generation technique is quite different from the previous example, as numpy methods are used directly to generate random arrays. In addition, this example multiplies the slope by a small amount of random noise (rather than adding noise to the linear model as is usually assumed in the linear regression derivation).
Up to this point, we have discussed linear regression for a single independent variable. Watch this video to see how to extend these ideas to multiple linear regression, which constructs linear models using multiple independent variables.
The LinearRegression method in sckikit-learn can handle multiple independent variables to perform multiple linear regression. Follow this tutorial which combines your knowledge of pandas with scikit-learn.
The key to creating any statistical model is to verify if the model actually explains the data. In addition to simple visual inspection, residuals provide a pathway for making a rigorous estimate of model accuracy when applying any form of regression. Read this overview of how residuals are applied.
Use this video to tie up any conceptual loose ends and see more examples of how residuals can help evaluate model accuracy.
Work through this project to put together the concepts introduced so far. To navigate to the notebook portion of the project, click on the SLRProject.ipynb link near the top of the page to run it in Google Colab. Assuming the following command has been invoked:
import pandas as pd
you can run the commands:
data = pandas.read_csv("http://www.econometrics.com/intro/SALES.txt")
data.head()
to access the dataset and see the first few lines printed to the screen.
Always be suspicious of a perfect fit for your data for machine learning problems. A model that fits a training set well but gives poor testing results is said to overfit the training data. This caution is reserved for any learning model. We introduce it here as a means of connecting concepts together with the data mining units. Read the following article for an overview of overfitting.
Follow this practice example to see how overfitting can occur within a learning model.
Read through this article for a brief visual summary of cross-validation.
Cross-validation is a technique for validating learning models. Up until this point in the course, model evaluations have only been applied using a single test (usually by splitting up a data set into a training set and a test set). In practice, a statistical distribution of test results must be constructed. Only then can confidence intervals be applied to the resulting distribution. Read through this article to understand cross-validation.
Work through this programming example in order to implement a cross-validation scheme on a scikit-learn data set you have seen in the previous units.
Use this project as a culminating exercise to implement the concepts presented in this unit. This exercise will show you how to obtain a data set, create the model, examine residuals, visualize results, validate the model and apply the model.
Take this assessment to see how well you understood this unit.