Topic: Unit 9: Data Mining III – Statistical Modeling | CS250: Python for Data Science | Saylor Academy

There is more to data science than simply analyzing data. The ability to build a model from a data set implies that some deeper relationship amongst data observations has been captured. Now that we have covered the basics of data mining, it makes sense to consider statistical models that allow for inference and also have some predictive power.

This unit demonstrates how to apply the scikit-learn module to building regression models. We will also show how to interpret model parameters once a model has been constructed. This unit will teach you how to apply Python for creating regression models, drawing inferences, and making predictions using computed model parameters.

Completing this unit should take you approximately 5 hours.

Select activity Upon successful completion of this unit, you will ...
Upon successful completion of this unit, you will be able to:

explain linear regression concepts;

apply the scikit-learn module to build linear regression models;

apply the scikit-learn module to validate linear regression models; and

explain data overfitting.

9.1: Linear Regression
- Select activity Simple Linear Regression
  
  Simple Linear Regression Page
  
  Students must
  
  Mark as done
  
  Now that data mining algorithms and, in particular, supervised learning concepts have been covered, it is time to address the construction of statistical models. The subject of linear regression has been mentioned in a perfunctory way at several points throughout the course. In this unit, we will delve more deeply into this technique. In its simplest form, the goal is to optimally identify the slope and intercept for empirical data assumed to depend linearly upon some independent variable. Linear regression is a statistical supervised learning technique because training data for the independent variable is mapped to data associated with the dependent variable. Once the linear model is created, obtaining estimates for data not contained within the training set becomes possible. Ensure you understand the examples and associated calculations in the video, such as residuals, the correlation coefficient, and the coefficient of determination. Additionally, if necessary, you may want to review hypothesis testing and tests for significance introduced in the statistics unit. After this video, you will learn how to implement this technique using scikit-learn. However, as a programming exercise, you should feel confident in writing code to implement the regression equations.
- Select activity Implementing Simple Linear Regression with scikit-learn
  
  Implementing Simple Linear Regression with scikit-learn Page
  
  Students must
  
  Mark as done
  
  This tutorial runs through the basics of scikit-learn syntax for linear regression. Pay close attention to how the data is generated for this example. Notice how list comprehension is used to create a list for the dependent and independent variables. Furthermore, the dependent variable list is formed by adding random Gaussian noise to each value in the independent variable list. These lists are then converted to numpy arrays to train the linear regression model. By construction, there exists a linear relationship between the independent and dependent variables. Linear regression is then used to identify the slope and intercept, which should match the empirical data. Finally, predictions of the dependent variable are made using independent variable data not contained in the training set.
- Select activity Practicing Linear Regression
  
  Practicing Linear Regression Page
  
  Students must
  
  Mark as done
  
  Follow this example for more practice with linear regression implementation. Like the previous example, by construction, data is generated having a linear relationship. However, notice that the data generation technique is quite different from the previous example, as numpy methods are used directly to generate random arrays. In addition, this example multiplies the slope by a small amount of random noise (rather than adding noise to the linear model as is usually assumed in the linear regression derivation).
- Select activity Multiple Linear Regression
  
  Multiple Linear Regression Page
  
  Students must
  
  Mark as done
  
  Up to this point, we have discussed linear regression for a single independent variable. Watch this video to see how to extend these ideas to multiple linear regression, which constructs linear models using multiple independent variables.
- Select activity Multiple Regression in scikit-learn
  
  Multiple Regression in scikit-learn Page
  
  Students must
  
  Mark as done
  
  The LinearRegression method in sckikit-learn can handle multiple independent variables to perform multiple linear regression. Follow this tutorial which combines your knowledge of pandas with scikit-learn.
9.2: Residuals
- Select activity The Assumptions of Simple Linear Regression
  
  The Assumptions of Simple Linear Regression Book
  
  Students must
  
  Mark as done
  
  The key to creating any statistical model is to verify if the model actually explains the data. In addition to simple visual inspection, residuals provide a pathway for making a rigorous estimate of model accuracy when applying any form of regression. Read this overview of how residuals are applied.
- Select activity Residual Plots and Regression
  
  Residual Plots and Regression Page
  
  Students must
  
  Mark as done
  
  Use this video to tie up any conceptual loose ends and see more examples of how residuals can help evaluate model accuracy.
- Select activity Simple Linear Regression Project
  Simple Linear Regression Project Page
  
  Students must
  
  Mark as done
  
  Work through this project to put together the concepts introduced so far. To navigate to the notebook portion of the project, click on the SLRProject.ipynb link near the top of the page to run it in Google Colab. Assuming the following command has been invoked:
  
  import pandas as pd
  
  you can run the commands:
  
  data = pandas.read_csv("http://www.econometrics.com/intro/SALES.txt")
  data.head()
  
  to access the dataset and see the first few lines printed to the screen.
9.3: Overfitting
- Select activity Overfitting
  
  Overfitting Page
  
  Students must
  
  Mark as done
  
  Always be suspicious of a perfect fit for your data for machine learning problems. A model that fits a training set well but gives poor testing results is said to overfit the training data. This caution is reserved for any learning model. We introduce it here as a means of connecting concepts together with the data mining units. Read the following article for an overview of overfitting.
- Select activity Overfitting in a Learning Model
  
  Overfitting in a Learning Model Page
  
  Students must
  
  Mark as done
  
  Follow this practice example to see how overfitting can occur within a learning model.
9.4: Cross-Validation
- Select activity What is Cross-Validation?
  
  What is Cross-Validation? Page
  
  Students must
  
  Mark as done
  
  Read through this article for a brief visual summary of cross-validation.
- Select activity More on Cross-Validation
  
  More on Cross-Validation Page
  
  Students must
  
  Mark as done
  
  Cross-validation is a technique for validating learning models. Up until this point in the course, model evaluations have only been applied using a single test (usually by splitting up a data set into a training set and a test set). In practice, a statistical distribution of test results must be constructed. Only then can confidence intervals be applied to the resulting distribution. Read through this article to understand cross-validation.
- Select activity Cross-Validation in Machine Learning
  
  Cross-Validation in Machine Learning Page
  
  Students must
  
  Mark as done
  
  Work through this programming example in order to implement a cross-validation scheme on a scikit-learn data set you have seen in the previous units.
- Select activity Statistical Modeling Project
  
  Statistical Modeling Project URL
  
  Students must
  
  Mark as done
  
  Use this project as a culminating exercise to implement the concepts presented in this unit. This exercise will show you how to obtain a data set, create the model, examine residuals, visualize results, validate the model and apply the model.
Unit 9 Assessment
- Select activity Unit 9 Assessment
  Unit 9 Assessment Quiz
  
  Students must
  
  Receive a grade
  
  Take this assessment to see how well you understood this unit.
  
  This assessment does not count towards your grade. It is just for practice!
  
  You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
  
  You can take this assessment as many times as you want, whenever you want.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python