9.1: Linear Regression
Now that data mining algorithms and, in particular, supervised learning concepts have been covered, it is time to address the construction of statistical models. The subject of linear regression has been mentioned in a perfunctory way at several points throughout the course. In this unit, we will delve more deeply into this technique. In its simplest form, the goal is to optimally identify the slope and intercept for empirical data assumed to depend linearly upon some independent variable. Linear regression is a statistical supervised learning technique because training data for the independent variable is mapped to data associated with the dependent variable. Once the linear model is created, obtaining estimates for data not contained within the training set becomes possible. Ensure you understand the examples and associated calculations in the video, such as residuals, the correlation coefficient, and the coefficient of determination. Additionally, if necessary, you may want to review hypothesis testing and tests for significance introduced in the statistics unit. After this video, you will learn how to implement this technique using scikit-learn. However, as a programming exercise, you should feel confident in writing code to implement the regression equations.
This tutorial runs through the basics of scikit-learn syntax for linear regression. Pay close attention to how the data is generated for this example. Notice how list comprehension is used to create a list for the dependent and independent variables. Furthermore, the dependent variable list is formed by adding random Gaussian noise to each value in the independent variable list. These lists are then converted to numpy arrays to train the linear regression model. By construction, there exists a linear relationship between the independent and dependent variables. Linear regression is then used to identify the slope and intercept, which should match the empirical data. Finally, predictions of the dependent variable are made using independent variable data not contained in the training set.
Follow this example for more practice with linear regression implementation. Like the previous example, by construction, data is generated having a linear relationship. However, notice that the data generation technique is quite different from the previous example, as numpy methods are used directly to generate random arrays. In addition, this example multiplies the slope by a small amount of random noise (rather than adding noise to the linear model as is usually assumed in the linear regression derivation).
Up to this point, we have discussed linear regression for a single independent variable. Watch this video to see how to extend these ideas to multiple linear regression, which constructs linear models using multiple independent variables.
The LinearRegression method in sckikit-learn can handle multiple independent variables to perform multiple linear regression. Follow this tutorial which combines your knowledge of pandas with scikit-learn.