CS250 Study Guide

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: CS250 Study Guide
Printed by: Guest user
Date: Tuesday, May 7, 2024, 5:27 PM

Navigating this Study Guide

Study Guide Structure

In this study guide, the sections in each unit (1a., 1b., etc.) are the learning outcomes of that unit. 

Beneath each learning outcome are:

  • questions for you to answer independently;
  • a brief summary of the learning outcome topic; and
  • and resources related to the learning outcome. 

At the end of each unit, there is also a list of suggested vocabulary words.

 

How to Use this Study Guide

  1. Review the entire course by reading the learning outcome summaries and suggested resources.
  2. Test your understanding of the course information by answering questions related to each unit learning outcome and defining and memorizing the vocabulary words at the end of each unit.

By clicking on the gear button on the top right of the screen, you can print the study guide. Then you can make notes, highlight, and underline as you work.

Through reviewing and completing the study guide, you should gain a deeper understanding of each learning outcome in the course and be better prepared for the final exam!

Unit 1: What is Data Science?

1a. Explain what data science is

  • What is data science?
  • What disciplines are associated with data science?
  • What types of data does data science deal with?
  • What is data engineering?

Data science is the field of collecting, handling, and analyzing data to draw knowledge from data. It has a long history dating back to when the term "data science" was coined in 2001. As the ability to store and operate on large databases increased, it became clear that a convergence of many different disciplines was required to draw conclusions from large, possibly distributed datasets. Hence, data science requires overlapping expertise in methods drawn from computer science, mathematics, and statistics. At its core, it is directly related to the scientific method as it describes the process of formulating a hypothesis, acquiring the necessary data, performing data analysis, and, finally, drawing conclusions or making predictions. Applying the scientific method requires logical reasoning, which can be divided into two broad categories: deductive and inductive. For example, conclusions may be drawn from a hypothesis test that attempts to quantify the deviation from a null hypothesis, a proposition that reflects current understanding.

It is important to be aware of the many types of data, such as video, audio, images, text, number, and so on. When embarking upon a data science project, data may be converted or transformed from its raw form into a different form that lends itself to a specific analysis technique. Data engineering is the aspect of data science that deals specifically with collecting, curating, storing, and retrieving data. In some sense, data engineering is the initial point from which all other analyses will follow.

To review, see A History of Data Science.

 

1b. Explain data analysis and data modeling methodologies

  • What are the essential aspects of the data science life cycle?
  • What is a data science model?
  • What basic modeling and analysis methodologies should every data scientist know about?

The data science life cycle emphasizes the reality that, during data analysis and modeling, there may not be a perfectly straight line between input data and output results. Since conclusions are not known a priori, it is often necessary to take initial results, refine them and then reassess the analysis or modeling methodology. Along the data life cycle journey, it may become necessary to build a model; hence, there are basic approaches that every data scientist needs to be aware of. A model is a data science methodology for representing a system that takes input data and generates outputs consistent with what is expected from a dataset(s) under consideration. Data models that are statistical in their approach are usually constructed from samples taken from a population. In this class of models, it is important to understand concepts such as target population, access frame, and sampling. A major reason for constructing these class models is to reduce statistical bias, which is the difference between your model and what is actually measured in reality.

Linear models are another important class of models that can either be statistical (such as regression) or deterministic (such as the method of least squares), but the main goal is to identify the best straight line that explains the data. Probabilistic models can be useful for generating and analyzing random data sets. A good example is the urn model, which analyzes the process of drawing indistinguishable marbles from an urn.

To review, see The Data Science Lifecycle.

 

1c. Explain techniques for approaching data science

  • How do various disciplines view data science problems?
  • Why is visualization important to data science?
  • How do techniques from optimization theory play a role in data science?

As data science involves the intersection of many disciplines, it is important to understand how individuals from these disciplines view data science problems. For example, mathematicians might view a problem in terms of theorems and equations; on the other hand, computer scientists might think of numerical methods and algorithms. Visualization and data rendering in data science are essential because a picture truly can be worth a thousand words. Rendering data in two or three dimensions (using, for example, a heatmap) can often reveal correlations within a dataset via immediate visual inspection.

Statisticians often think in terms of sampling, where statistical tests are a function of the data available. For example, probability sampling is a general term for selecting a sample from a population. Simple random sampling is a type of probability sampling where a subset of participants is chosen randomly from a population. For larger populations, cluster sampling is a type of probability sampling where a population is divided into smaller groups or "clusters". A sample is then formed by randomly selecting from among the clusters. Data scientists must know how to sample a population for proper experimental design.

Optimization theory is a field necessary for implementing data science techniques. Although this fact is not always pointed out at the introductory level, data science results often come as the output of some objective optimization criteria. For example, when finding critical points in a function such as maxima or minima, points where a slope equals zero are identified. Additionally, objective measures such as mean squared error and loss functions are regularly applied in machine learning techniques to measure the performance of a model.

To review, see A History of Data Science, The Data Science Lifecycle, and Thinking about the World.

 

Unit 1 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam. 

  • cluster sampling
  • data engineering
  • data science
  • hypothesis testing
  • indistinguishable
  • logical reasoning
  • loss function
  • model
  • random sampling
  • sample
  • scientific method
  • slope
  • statistical bias
  • visualization

Unit 2: Python for Data Science

2a. Create a Python notebook using Google Colab 

  • What is Google Colab?
  • Why are Python notebooks useful?
  • How can Python modules be installed in Google Colab?
  • Where are Google Colab notebooks stored?

Python notebooks are extremely useful for testing blocks of Python code and annotating the notebook with text cells. You can easily share them with other users who can experiment with the Python code contained within the notebook. Google Colaboratory is an online Python notebook environment that bypasses the need to install modules commonly applied by the Python community for a large number of cases. If it becomes necessary to install a module, the command

!pip install module_name

can be used to invoke the installation.

Using Google as the online service requires an awareness of where and how notebooks are stored. You must be familiar with your Google drive. Colab will create a folder for your notebooks as the default destination directory on your Google drive. Notebooks are also downloadable from Colab so that they can be stored or used locally on a platform such as Jupyter notebooks. Within the Google Colab environment, you must understand how to create and delete code cells. In addition, there are several options for running code cells depending on how you have arranged your code. Navigating these various options must be part of your Google Colab skill set.

To review, see Introduction to Google Colab.

 

2b. Execute instructions using built-in Python data and control structures 

  • What are some important built-in Python data types?
  • What are some important built-in Python data structures?
  • What kind of loop structures does the Python language support?
  • Why is operator precedence important for if-else-elif statements?

This course depends upon having basic operating knowledge of the Python programming language. Python supports basic data types such as integers (int), floating-point (float), strings (str), and boolean (bool). Lists and tuples are ordered containers, meaning their elements can be referred to using an index. Lists are mutable objects meaning you can modify their elements. Tuples are immutable objects meaning their elements, once initialized, cannot be modified. Sets and dictionaries are unordered containers, but dictionaries contain "key:value" pairs where values can be referenced by their keys.

It is important to understand relational, boolean, and arithmetic operators. Often, when applying if-elif-else control structures, complex boolean expressions are necessary. Constructing these expressions requires understanding the precedence of relational operators (==, !=, >, <, >=, <=) over boolean logical operators (and, or, not) and the precedence of and over or.

Although this course does not emphasize object-oriented programming and class design, the syntax of method (and function) calls using the def keyword and accessing class data attributes should be well understood. Finally, Python supports "for" loops and "while" loops. Therefore, familiarity with iteration using these loop structures is necessary for your Python programming capacity.

To review, see Data Types in Python, Functions, Loops, and Logic, Data Structures in Python, and Sets, Tuples, and Dictionaries.

 

2c. Apply methods for random numbers within the random module 

  • What is the random module?
  • What are some useful methods contained in the random module?
  • How can the random module be applied?

Random number generation is a significant facet of computer science. Simulations involving random events such as data communications, sunspot activity, the weather, and data science simulations often require some form of random number generation (RNG). Several ways of approaching the issue of generating random numbers are introduced within this course, but the starting point is the random module.

While there are many methods contained within the random module, there are some that you must be familiar with. The seed method allows you to set the random seed so that the RNG can be set to the same starting point. Data scientists must understand two of the most basic probability distributions: the uniform distribution and the normal distribution. The random method generates numbers from a uniform distribution within the interval [0.0, 1.0), and the uniform method generates numbers from a uniform distribution within the interval [a, b. The randint uniformly generates integers in the interval [a,b]. The gauss method generates numbers from a normal distribution with the mean and standard deviation values as input parameters. The setstate and getstate methods allow you to either set the state of the RNG or read the state so that it can be saved and used for later use. These methods are the minimal set to get up and running to run basic simulations in applied statistics.

To review, see Python's random Module.

 

2d. Implement basic plotting and data rendering instructions using the matplotlib module 

  • What is matplotlib?
  • How can matplotlib be applied?
  • What are some important methods in the matplotlib module?

A critical component of data science is visualization. This course introduces several modules for this purpose, but matplotlib is the starting point. Specifically, the pyplot portion of matplotlib is emphasized within this module to exercise a set of introductory commands for getting the plot up and running. Furthermore, without having introduced numpy yet, lists are used as the data structure for creating two-dimensional plots; therefore, you must be clear on the syntax for plotting list data.

While there are many methods contained within the matplotlib.pyplot module, there are some that you must be familiar with. Line plots using the plot method and scatter plots using the scatter method are fundamental. You should also know how to choose colors and plot markers such as dashed lines. Annotating plots requires methods like title, xlabel, ylabel, grid, and legend. These methods are the minimal set to get up and running to begin the journey of data visualization.

To review, see Precision Data Plotting with matplotlib.

 

Unit 2 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • boolean
  • def
  • dictionary
  • floating point
  • for loop
  • function
  • gauss
  • getstate
  • if-elif-else
  • integer
  • list
  • logical operator
  • matplotlib
  • normal distribution
  • random
  • relational operator
  • set
  • setstate
  • string
  • tuple
  • Uniform
  • while loop

Unit 3: The numpy Module

3a. Implement instructions to create numpy arrays 

  • What is numpy?
  • How is numpy used?
  • What is the syntax for indexing an element within a numpy array?
  • What are some typical instructions applied when using numpy?

The numpy module is fundamental for applying numerical and linear algebra techniques involving arrays of data. Some of the most used modules for data science and machine learning, such as scipy, scikit-learn, pandas, and statsmodels, use numpy as the core for their class constructions. Mathematical entities from linear algebra, such as vectors and matrices, can be represented using numpy arrays. In computer science, an array is simply a data structure for housing elements of the same data type. In the case of numpy arrays, since the intent is usually a numerical one, the data type of array elements is most often int, float, or bool (although string arrays are possible).
 
In numpy, a vector is represented using a one-dimensional array (where only one index is used to refer to array elements). A two-dimensional matrix is represented using a two-dimensional array using two indices, and so on for higher dimensions. The syntax for applying indices to multi-index arrays is the same as that of nested lists; however, a more convenient form is usually used to minimize the use of square brackets. For example, assuming a four-dimensional array is initialized as a = np.random.normal(size=(3,4,2,5)). It is equally valid to index an array element using either a[2][3][0][4] or a[2,3,0,4].

Basic operations for applying numpy include the array method for creating an array, the shape method for determining the shape of an array, the sum method for summing the values contained within an array, and the max and min methods for computing the maximum and minimum elements. The ones method can be used to create an array of ones; the zeros method will create an array of zeros. The eye method can be used to create an identity matrix. A sequence of numbers can be generated in an array using the arange method, and regularly spaced sequences of numbers between particular values can be generated using the linspace method. It is important to understand the difference between these two methods. Finally, you should have a basic understanding of generating an array of random integers using the randint method from the random class within numpy.

To review, see numpy for Numerical and Scientific Computing.

 

3b. Execute instructions to index arrays using slicing 

  • What is slicing?
  • What is the syntax for slicing in numpy?
  • What are some typical shortcuts when applying slicing?

Slicing is an indexing technique for extracting several contiguous elements at once from an ordered container, such as a list or a tuple. The syntax for slicing extends to numpy arrays with multiple indices. Slicing a numpy vector (that is, a one-dimensional array) works the same way as slicing a Python list. Three values must be specified or implied: the start index, the stop index, and the step, b[start:stop:step], where b is a numpy vector. Slicing a multidimensional array means that this syntax can be applied to any index position.

When the start is omitted, an index of 0 is assumed. When the stop is omitted, the last index is assumed. When the step is omitted, a step of one is assumed. You should also be comfortable with using negative indices or a negative step. For example, d = c[::-1] would form a new vector using all the elements of c in reverse order. This is because the negative step counts backward. Also, the last index is assumed because the stop is omitted, and the 0 index is assumed because the start is omitted. Finally, always remember the element referred to by the stop has an index that is one less than the stop. For example, consider a 2-d array a with at least 2 rows and 6 columns. A command such as b = a[0:1,4:5] will slice out a 1×1 array that is inherently two-dimensional (that is, two indices are required to refer to the element). You should test the similarities and differences between b, b[0] and b[0][0] to internalize this example.

To review, see Advanced Indexing with numpy.

 

3c. Demonstrate computation and visualization using array operations 

  • What is vectorized programming?
  • What is broadcasting?
  • How does matplotlib work with numpy arrays?

The numpy module has been designed to work with vectors and matrices. Most programmers start with languages that are scalar. For example, the random module is designed to generate scalar random numbers, and, using only Python built-in data structures, we could use a loop to fill a list to mimic a vector of random numbers. The numpy module is designed for vectorized programming, where single commands can be used to generate and operate on a vector or a matrix of data.

To correctly apply the vectorized methodology in numpy, it is important to understand how broadcasting is used to accomplish "element-wise" computation for arithmetic operators such as +, -, *, and /. For example, to perform * between a 3×1 matrix and a 3×5 matrix, the 3×1 matrix will be broadcast 5 times along the column dimension and will be multiplied elementwise by each column in the 3×5 matrix. Notice that element-wise multiplication using * is different from matrix multiplication in the linear algebraic sense. To accomplish matrix multiplication, either the @ operator or the dot method must be applied to the numpy arrays.

When it comes to plotting data, the matplotlib module has been designed to work with numpy data. The matplotlib module capacity to plot Python list data is a special case made available for convenience. The typical use case and syntax for matplotlib plotting methods are meant for numpy arrays.

To review, see numpy Arrays and Vectorized Programming, Mathematical Operations with numpy, and numpy with matplotlib.

 

3d. Explain instructions to load and save data using numpy file formats 

  • How is numpy file handling similar to basic Python file handling?
  • How is numpy file handling different from basic Python file handling?
  • What is a .npy file?
  • What is a .npz file?

Understanding how to save and load numpy array data in machine learning applications is important. In such cases, arduous and time-consuming computations can lead to large parameter matrices that must be stored for later use. The simplest solution is to apply the loadtxt and savetxt methods, which load and save numpy data in text format. It is also possible to read and write numpy data in text format using basic Python file handling methods such as write, read, or readline. In fact, it is possible to read data using the numpy method loadtxt from a text file generated using the write method.

Text files can be large compared to their numerical or "binary" counterparts. Hence, the binary format is the preferred method for storing and loading numpy data. Furthermore, it is possible to compress the data to save storage space. The ability to store compressed array data and multiple arrays differs from typical Python file handling. The .npy extension is used for the standard binary file format for saving a single numpy array. The .npz extension is used for the standard binary file format for saving multiple arrays to a single file. Therefore, the numpy save method can be used to save a single array. The numpy savez method can be used to save multiple arrays, and the numpy savez_compressed method can be used to save multiple arrays in compressed format. The numpy load method then can be used to load either .npy or .npz formatted files.

To review, see Storing Data in Files and ".npy" versus ".npz" Files.

 

Unit 3 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • arange
  • array
  • broadcasting
  • index
  • linspace
  • .npy
  • .npz
  • ones
  • randint
  • slicing
  • vectorized programming
  • zeros

Unit 4: Applied Statistics in Python

4a. Apply methods for random numbers within the numpy module

  • How does vectorized programming work with numpy random number generation?
  • What are some similarities between the Python random module and the numpy random class?
  • What are some important numpy random class methods?

The key to mastering programming using numpy is to migrate from scalar thinking to vectorized thinking. Methods from the random module can be used to generate a single random number, the numpy.random class is designed to generate vectors and arrays of random numbers. Both contain methods to generate random numbers from basic distributions such as uniform, normal, lognormal, exponential, beta, and gamma. However, the numpy.random class is equipped with a much larger class of distributions and is designed for vectorized programming. Therefore, you should be familiar with computing important quantities such as the sum, max, min, and mean using the axis parameter.

The set of random number generators for numpy.random is quite large. As a data scientist, you may not have an immediate need for all of them, but there are some distributions that you should be aware of (such as uniform, normal, lognormal, logistic, Poisson, and binomial). This means not only understanding what the methods compute, but you should also be highly familiar with their input parameters, their syntax, and the order for referring to them during a method call. For example, randint will uniformly generate random integers with the input parameters low and high. Additionally, the array dimensions can be specified with the size parameter as a tuple. The normal method will generate array data from a normal distribution where the mean and standard deviation can be specified. The main goal is to connect your understanding of basic probability distributions with the necessary input parameters. For instance, the poisson method allows for the input parameter lam. Other methods for random sampling such as shuffle and choice should also be reviewed.

To review, see Random Number Generation and Using np.random.normal.

 

4b. Apply statistical methods within the scipy.stats module 

  • How is scipy.stats similar to numpy?
  • What are some useful methods for applying the scipy.stats module?
  • What are some important statistical tests that can be implemented using the scipy.stats module?

The scipy.stats module is built upon the numpy module. Both modules can generate random numbers for random simulations from various probability distributions. The scipy.stats module goes a bit further as it can perform a wide range of statistical tests and build statistical models.

Concerning scipy.stats usage, you should be comfortable with methods for generating summary statistics such as mode, tmin, tmax, tmean, tvar, skew, kurtosis, moment, and entropy. You should also recognize the consistency of the syntax amongst various random variables for method calls, such as rvs, mean, std, ppf, pmf (for discrete distributions), and pdf (for continuous distributions). When using a method such as rvs for a given distribution, you must be familiar with the input parameter syntax relevant to the specific distribution. In other words, the rvs method will require different parameters for generating random data using distributions such as chi2, norm, f, binom, and lognorm.

Lastly, scipy.stats can perform a breadth of statistical tests. You should be aware of tests such as the t-test for the mean of one group of scores (ttest_1samp), the t-test for the means of two independent samples of scores (ttest_ind), the Shapiro-Wilk test for normality (shapiro), the one-way chi-square test (chisquare) and skewtest, which tests whether the skewness is different from the normal distribution.

To review, see Descriptive Statistics in Python and Statistical Modeling with scipy.

 

4c. Apply the scipy.stats module for solving data science problems 

  • What statistics quantities are important for solving data science problems?
  • How are such quantities computed using the scipy.stats module?
  • What types of problems can be modeled and simulated using the scipy.stats module?

In addition to understanding the syntax for invoking scipy.stats methods, as a data scientist, your goal is to understand how to apply them to data science problems. This means you should be clear about computing quantities such as the Z-score using the zscore method and confidence intervals based upon the empirical mean and standard deviation.

Part of your skill set as a data scientist is knowing how to apply your knowledge of probability distributions to model a given set of data. According to the Central Limit Theorem, problems involving samples from sums of independent random variables can be modeled using the normal distribution. The binomial distribution can be useful for modeling sums of Bernoulli variables (such as a series of coin flips). Stochastic processes involving bursts of an event (such as modeling phone call arrival times) can be modeled using the Poisson distribution. Models involving dice rolls can be modeled using a uniform distribution. Mastering this aspect of data science will enable you to create simulations that can model and explain real data.

To review, see Statistics in Python and Probabilistic and Statistical Risk Modeling.

 

Unit 4 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • Bernoulli distribution
  • binomial distribution
  • Central Limit Theorem
  • chisquare
  • choice
  • confidence interval
  • poisson
  • rvs
  • shapiro
  • shuffle
  • skewtest
  • summary statistics
  • ttest_1samp
  • Z-score

Unit 5: The pandas Module

5a. Explain similarities and differences between dataframes and arrays

  • What is a series?
  • What is a dataframe?
  • How does indexing a pandas dataframe compare with indexing a numpy array?

The pandas module has been built upon the numpy module to handle data like a spreadsheet or a table within a relational database. You can think of a pandas series as holding one-dimensional data. A pandas dataframe can be thought of as a container for two-dimensional data where the column names can be used to refer to data within a row. Generally speaking, the goal of using a numpy one-dimensional array is often similar to that of a series; hence, both data structures are designed to contain homogeneous data.

While the bulk of the course focuses on dataframes, you should feel comfortable creating series using the Series command and creating dataframes using the DataFrame command. You should be aware of the flexibility of using various data structures such as lists, dictionaries, and numpy arrays for initializing a series or a dataframe.

Once you have created a dataframe, the pandas module offers you several different ways of referencing and operating on the data. You should be familiar with referencing data using the column names and the index. Additionally, you should understand how slicing works with dataframes. Slicing rules in pandas work differently than numpy slicing with indices, as the last value is included if pandas slicing is performed. This is critical to understand when applying loc (which extracts elements using the dataframe index and column names with pandas slicing) versus iloc (which extracts elements as if the dataframe is an array using row and column indices with numpy slicing).

To review, see Dataframes.

 

5b. Apply instructions for cleaning data sets 

  • What is data cleaning?
  • What are some methods helpful in identifying and counting missing values?
  • What are some pandas approaches available for cleaning a data set?

On the practical side of sampling a population and dataset creation, it is possible to end up with a dataframe containing missing values. If such values are not dealt with at the start of a data science endeavor, they could result in spurious calculations with unintended consequences (such as an arithmetic exception). Data cleaning simply means that an approach has been taken to identify and either remove or fill in missing values in a principled way. As a first step towards identifying missing data and non-missing data, methods such as isna and notna should be reviewed. Recall that you can use these methods in tandem with other methods, such as sum, to count the number of missing values along a given axis.

Once missing values have been identified, a couple of options are at your disposal. If the percentage of missing values is small, then the dropna method can help remove them while not injuring the overall statistical content of the data. It is important to know how to apply the axis, how, and inplace input parameters to be very specific about how the missing values are to be removed. If the percentage of missing values is too high, then another option is to fill in those values in a principled manner using the fillna method. With this approach, there are many options to choose from. For example, you can use the method input parameter to perform forward or backward fills. Additionally, you could choose a value based upon, for example, the column mean or median. The solution for the data fill is quite data-dependent and, in reality, may not always result in a positive scientific outcome. On the other hand, as a data scientist, you must be aware of the tools at your disposal to deal with missing data.

To review, see Data Cleaning.

 

5c. Implement operations on dataframes

  • What types of models have dataframes been designed to work with?
  • What types of operations are important for working with dataframes?
  • What are the syntax details for implementing dataframe operations?

The concept of the dataframe exists to encompass spreadsheet and relational database models. Therefore, it is essential to understand how to operate on dataframes within this context. On the spreadsheet side of dataframes, arithmetic operations on (sets of) columns or (sets of) rows need to be mastered. On the database side, one thinks of performing queries and operations such as join and concatenate for merging tables.

You can use the query method for dataframe queries. Although a pandas method implementation exists for any query, phrasing a query using the query method can often simplify the syntax. For example, you can reduce an isin method call to the in set operation.

When performing arithmetic operations on dataframes, your indexing skills should be able to weather the task of referring to the appropriate columns. You must be aware of the subtleties and default mode for dataframe operations. For example, consider the addition of a series to a dataframe. In this case, you can think of the series as a row vector that will be broadcast and added elementwise to each row in the dataframe. Additionally, if the dataframe has more columns than the length of the series, then the extra columns in the dataframe will end up with NaN (missing) values.

Finally, you should be clear about the subtleties of merging dataframes. By default, the concat method concatenates dataframes along the row dimension. As with database operations, you can implement the join method in various contexts, such as inner, outer, left, and right, using the how input parameter. After performing these join operations, you should know if and how you will fill in missing values. If you understand the inner, outer, left, and right join operations, you can predict how missing values will be dealt with.

To review, see pandas Operations: Merge, Join, and Concatenate.

 

5d. Write Python instructions for interacting with spreadsheet files 

  • What are some common file formats for handling spreadsheet data?
  • What are some useful pandas methods for reading from and writing to spreadsheet files?
  • What is the syntax for invoking relevant input parameters?

Spreadsheet programs are instrumental for organizing data. Two popular formats for interacting with spreadsheets are Excel format (.xls, .xlsx) and comma separated values (.csv) format. The pandas module is designed to handle these formats (and others, such as SQL and JSON formats). The pandas read_csv and read_excel methods can read spreadsheet files, and the to_csv and to_excel methods will write to them in the appropriate format. These methods are indispensable for interacting with the outside world. You can use them both for local file storage and accessing files via a URL.

In addition to the syntax for calling spreadsheet methods and specifying filenames, it is also important to understand the syntax for various input parameters. Practical experience with spreadsheets is useful for understanding the utility of multiple sheets. The make_tab method allows for creating a sheet in tandem with the to_excel method. On the other hand, when applying the read_excel method, the sheet_name parameter is used to refer to a sheet. Finally, when writing to a spreadsheet using the to_excel method, it is also helpful to know how to suppress the dataframe index from being included within the file using the index parameter.

To review, see Data Input and Output.

 

5e. Apply the built-in pandas visualization methods to visualize pandas dataframe data 

  • Why does pandas include visualization methods?
  • What is the syntax for invoking visualization methods in pandas?
  • How is this syntax similar to other visualization modules such as matplotlib?

There is a fair amount of cross-breeding between pandas and other data science modules such as numpy, seaborn, and matplotlib. This is done for the sake of convenience as it is sometimes simpler, from a programming standpoint, to connect commonly used method calls directly to a dataframe.

Like matplotlib, pandas allows for a spectrum of plotting methods such as line plots (line), box plots (box), bar plots (bar), histograms (hist), scatter plots (scatter), and so on. As with other visualization modules such as seaborn, these techniques can also be invoked using the plot method and the kind input parameter. One important distinction is that dataframe plotting methods allow the option to handle missing values by either dropping them or filling them in. Hence, data cleaning can be applied in tandem with plot method calls.

To review, see Visualization Using the pandas Module.

 

Unit 5 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • concat
  • concatenate
  • data cleaning
  • DataFrame
  • dropna
  • fillna
  • iloc
  • isin
  • isna
  • join
  • loc
  • notna
  • query
  • read_csv
  • read_excel
  • Series
  • to_csv
  • to_excel

Unit 6: Visualization

6a. Apply seaborn commands to visualize pandas dataframe data

  • What is the seaborn module, and how is it used?
  • What are some useful plotting categories to be aware of?
  • What are important input parameter choices to be aware of?

The seaborn module is an enhanced visualization package designed for data science applications. It goes beyond the capabilities of matplotlib visualization where, for example, the set_theme method can augment the matplotlib plotting environment. You can use the seaborn module to render data in many ways by creating relational plots, categorical plots, distribution plots, multi-plots, matrix plots, and so on.

Seaborn has conveniently wrapped its plotting routines into specific categories. For example, a relational plot can be expressed using a line plot or a scatter plot. You can call the associated relational plotting methods individually, or they can be configured using the relplot method with input parameters such as kind set to their desired values. Distribution plots are used to plot histograms, marginal distributions, empirical cumulative distributions, and kernel density estimates, either by calling them individually or configuring the displot method. When applying the histplot method, you should be familiar with input parameters such as multiple for plotting multiple histograms and kde for superimposing a kernel density estimate. Category plots are useful for creating bar plots, violin plots, box plots, swarm plots, and count plots, and you should be familiar with the syntax for creating these plots using the catplot method. You should also be familiar with the default mode of these methods.

Finally, you should be aware of the subtle differences and applications when creating a pairplot versus using FacetGrid or displot. An input parameter such as margin_titles can be applied along with FacetGrid, which can be useful for visualizing conditional histograms of three-dimensional data in two dimensions. The pairplot method only considers comparing two variables and does not have a margin_titles option.

To review, see The seaborn Module.

 

6b. Apply advanced data visualization techniques 

  • How is the estimator chosen for statistical visualization techniques?
  • What parameter can be used to specify and visualize the confidence interval?
  • What options are available to stratify and order categorical data?

As you immerse yourself more deeply into the seaborn module, you will find that many parameters are available to fine-tune your visualizations and use more advanced techniques for controlling statistical measures. For example, methods such as barplot allow you to choose the estimator where the default function is the mean; however, an equally valid choice might be the median. Many seaborn methods allow you to include the confidence interval by setting the ci parameter. Adding this aspect of statistical visualization to line plots or modifying the ci parameter in bar plots can reveal much about a given data set. Additionally, by this point, you should have a strong command of the input parameters for the displot method. You should also be aware that the Pairgrid method allows you to control the diagonal scaling using the diag_sharey input parameter to balance the scale of the histogram heights.

The seaborn module offers a great deal of flexibility when it comes to rendering categorical data. Categorical plots will stratify the categories of a specific variable by setting the hue parameter. It is also important that you understand the goal of a box plot versus, for example, a violin plot to visualize quartile ranges, outliers, and kernel density estimates. The use of countplot should also be a part of your inventory for rendering categorical data. You should have a strong command of this method's input parameters. For example, you can control the ordering of the categories using the order parameter. The syntax of plot category parameters is quite consistent among various methods for rendering data.

To review, see Advanced Data Visualization Techniques.

 

6c. Apply the seaborn module to solve data science problems 

  • What is a matrix plot, and how can it be used to infer correlations with a data set?
  • How are joint plots useful for inferring data classification strategies?
  • How is seaborn used to visualize regressions?

In data science, knowing the syntax is half the battle; the other half is knowing how to apply problem-solving methods. In certain applications, a picture can truly be worth a thousand words. Plots of matrix data using a matrix plot such as heatmap can be extremely useful for visualizing empirical distributions of data. Furthermore, a correlation matrix plotted as a heatmap can be used to immediately infer positive, null, or negative correlations between variable pairs. Using a tool in this manner implies the ability to combine knowledge of statistics with advanced visualization techniques.

The jointplot method is useful for visualizing joint empirical data and will show you the marginal distributions of each variable. If the data is categorical, variable clusters can be further stratified using the hue input parameter. Applying the combination of joint, marginal, and stratified data can give an immediate picture and roadmap for designing a data classification scheme. Finally, you should be aware of the input parameter choices and subtle differences in allowed input data for plotting regressions with either lmplot or regplot. For example, both allow pandas dataframes as input, but only regplot accepts numpy arrays as input.

To review, see Advanced Data Visualization Techniques.

 

Unit 6 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • barplot
  • category plot
  • ci
  • countplot
  • displot
  • distribution plot
  • estimator
  • Facetgrid
  • heatmap
  • hue
  • jointplot
  • lmplot
  • margin_titles
  • matrix plot
  • regplot
  • relational plot

Unit 7: Data Mining I – Supervised Learning

7a. Explain supervised learning techniques

  • What is supervised learning?
  • How are distance functions used in learning systems?
  • What is overfitting versus underfitting?
  • What is the main goal of supervised learning?

Supervised learning attempts, using training data, to create a mapping between a set of input data and a set of desired output targets arranged in the form of training pairs. A spectrum of techniques exists, ranging from statistical techniques such as regression to classical techniques such as the Bayes' decision and k-nearest neighbors to deep learning neural networks. While the field is vast, some general concepts are common to learning systems. For instance, it is often important to measure the distance between points using a distance function such as the Euclidean distance or the Manhattan distance. Such functions can be useful in the classification problem, where a data point can be classified by determining the minimum distance to a given class.

In general, the main goal of supervised learning is to fit a specific model using the training data. Different supervised learning algorithms attempt to achieve this goal in different ways. The k-nearest-neighbors algorithm achieves this goal by applying a distance function. Logistic regression achieves this goal by applying a linear regression and then a logistic function. Decision trees achieve this goal by creating a tree of rules based on the distribution of the training data. During the training process, one must avoid underfitting and overfitting a model. While this topic is expounded upon in a later module, it is important at the outset of any machine learning lesson to have a qualitative grasp of these concepts. Underfitting occurs when a mathematical model cannot adequately capture the underlying structure of the data. Overfitting occurs when the learning model fits so perfectly to the training data that it cannot generalize to points outside the training set. Finally, it is important to be aware of the bias-variance tradeoff. Bias is the difference between the average prediction of a model and the correct value the model is trying to predict. Variance is the variability of a model prediction for a given data point when applied over a sample of training sets. A reliable supervised training algorithm should find an optimal combination between these quantities.

To review, see Data Mining Overview and Supervised Learning.

 

7b. Apply methods in the scikit-learn module to supervised learning 

  • What are the syntax details for implementing k-nearest neighbors?
  • What are the syntax details for implementing decision trees?
  • What are the syntax details for implementing logistic regression?

This course has chosen to focus on Python implementations of decision trees, logistic regression, and k-nearest neighbors to help you begin your journey in machine learning using the scikit-learn module. Common to many Python machine learning modules are the fit, predict and score methods. The fit method will fit a model based upon which technique has been instantiated. When applying supervised learning, the score method can quantitatively compare a supervised test set against the model's prediction using, for example, the mean squared error. The predict method will yield the model output given a specific set of inputs. The train_test_split method is extremely useful for creating training and test sets from a larger dataset.

When it comes to instantiations such as linear_model.LogisticRegression, neighbors.KNeighborsClassifier, and tree.DecisionTreeClassifier, you must be very clear about how to fit these models. While the fit syntax is consistent across such models, their input parameters will obviously vary. For example, the input parameter criterion for choosing the decision strategy is specific to the decision tree, and the input parameter p for choosing the metric in k-nearest neighbors is specific to the k-nearest-neighbors algorithm. Take some time to review the respective input parameters and output attributes.

To review, see k-Nearest Neighbors, Decision Trees, and Logistic Regression.

 

7c. Implement Python scripts that extract features and reduce feature dimension 

  • What is feature extraction?
  • Why is dimensionality reduction useful?
  • What preprocessing steps can be helpful to the feature extraction process?

After data is collected for a given data science application, the dataset is usually structured as a set of variables and observations. After each observation, the dataset is organized as a set of variables or 'features' after a set of processing steps. After a set of features is arrived at, the dataset is often subjected to preprocessing steps such as feature scaling or feature normalization to attribute equivalent weight to each feature. For example, instantiating preprocessing.MinMaxScaler will enable the scaling of all feature magnitudes between a minimum and maximum range (where the default range is between zero and one) and preprocessing.StandardScaler can be used to normalize features. The process of taking raw data and converting it into a set of scaled or normalized features is known as feature extraction.

When a large number of features (such as more than 10) is derived from a dataset, it is important to consider techniques that can reduce the dimensionality of the data. This is partly due to what some term the curse of dimensionality, where, as the dimension of the data (that is, the number of features) increases, all data points appear as if they are equidistant. In other words, the concept of a distance function for the classification problem is rendered ineffective in higher dimensions. Such a conclusion necessitates using methods for dimensionality reduction such as principal component analysis (which can be implemented in Python by invoking decomposition.PCA) or PCA. PCA is an eigenvector decomposition from the dataset that can determine which components affect the dataset most. Therefore, you must understand output attributes such as explained_variance_ratio_ and singular_values_, which describe each eigenvector component's 'strength' or contribution to the original feature set.

To review, see Principal Component Analysis.

 

7d. Train and evaluate models using data mining techniques 

  • What methods and techniques are applied when training and evaluating learning models?
  • What method can compute the mean accuracy of a model response to a test set?
  • How is the argmax calculation applied in pattern classification problems?

The training and evaluation of models using the scikit-learn module are generally accomplished using the train_test_split, fit, predict and score methods. Visualization techniques such as scatter, distribution, matrix, and regression plots (using, for example, matplotlib and seaborn) can also be of great use for evaluating, for example, classification performance. The score method computes the mean accuracy of a model response to a test set. If only the model output is desired, then the predict method is more appropriate.

When it comes to training and evaluating models, it is helpful to have some intuition regarding the output of a given technique. For example, for a method such as k-nearest neighbors, you should be able to predict a classifier output given a two-class problem on the real line with a small set of observations. It is also helpful to be familiar with the numpy argmax method for determining an index where a maximum value can be found within a numerical list or a vector. In classification problems, one can be equally concerned with the argmax location and the actual maximum value, which could be used as a confidence measure. Finally, it is often the case that one may want to optimally tune a parameter or choice of parameters for a given training set. An optimal search of this kind can be helped using the GridSearchCV method.

To review, see Training and Testing.

 

Unit 7 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • argmax
  • Bayes' decision
  • bias-variance tradeoff
  • classification problem
  • decision tree
  • decomposition.PCA
  • dimensionality reduction
  • distance function
  • Euclidean distance
  • explained_variance_ratio_
  • feature extraction
  • feature normalization
  • feature scaling
  • fit
  • GridSearchCV
  • k-nearest neighbors
  • linear_model.LogisticRegression
  • logistic regression
  • Manhattan distance.
  • minimum distance
  • neighbors.KNeighborsClassifier
  • overfitting
  • predict
  • preprocessing.MinMaxScaler
  • preprocessing.StandardScaler
  • principal component analysis (PCA)
  • score
  • singular_values_
  • supervised learning
  • train_test_split
  • training pairs
  • tree.DecisionTreeClassifier
  • underfitting

Unit 8: Data Mining II – Clustering Techniques

8a. Explain unsupervised learning concepts 

  • What distinguishes unsupervised learning from supervised learning?
  • What kind of applications are well suited for unsupervised learning?
  • What is the main goal of a clustering algorithm?

There exist classification problems where a given training set consists only of input data without any classification labels (or "output targets") associated with the input data. Under these circumstances, algorithms are required to partition the input data into subsets that will hopefully reveal important aspects of the data. Whereas supervised learning algorithms require preassigned targets, unsupervised learning algorithms do not. For example, recognizing images of license plate numbers or distinguishing between images of apples versus oranges would be considered supervised learning problems because the classes are known. On the other hand, the classification of elements within images of natural scenery would be an example of an unsupervised learning problem. This is because new elements not previously categorized could arise within an image.

Clustering algorithms offer a highly successful approach to solving unsupervised classification problems. They enable the data to tell you what the categories are in a principled way (and not the reverse). Hence, a major theme throughout all clustering techniques is their attempt to group similar data points into similar subsets. Under ideal circumstances, the derived clusters will have large intra-set and small inter-set distances. Some clustering techniques require the computation of a centroid (or "mean vector"), a vector whose components are the empirical mean of each component (or "variable") computed across all observations. Finally, since clustering algorithms are objective and unbiased, one potential drawback is that the clusters arrived at may not correspond to definable targets whose meaning can be interpreted.

To review, see Unsupervised Learning.

 

8b. Apply methods in the scikit-learn module to perform unsupervised learning

  • What are the main clustering algorithms covered in this course?
  • What is the syntax for invoking these clustering algorithms in scikit-learn?
  • What input parameters and output attributes are applicable to these algorithms?

While many clustering approaches exist, two of the most well-known algorithms are K-means clustering and agglomerative clustering. Objects for implementing them can be instantiated using scikit-learn by invoking cluster.KMeans and cluster.AgglomerativeClustering. The n_clusters input parameter specifies the number of clusters to be computed. After applying the fit method, cluster memberships of the training data can be determined using the labels_ output attribute.

The K-means algorithm requires the number of clusters as an input. If the number of clusters is unknown, then it is common to test a series of cluster counts to numerically determine the optimal cluster composition. The inertia_ attribute is useful for this purpose as it measures the sum of squared distances of observations to their closest cluster center. The agglomerative clustering algorithm allows for the number of clusters as an input parameter. The intra-set distance can be set using the linkage parameter, which allows for typical choices such as 'ward', 'complete', 'average', or 'single', where the 'ward' linkage is the default value. The affinity parameter can be used to choose the distance metric, which can be set to typical values such as 'euclidean', 'manhattan', 'l1', or 'l2'. It also allows for precomputed metrics. However, if Ward linkage has been chosen, the affinity parameter must be set to 'euclidean'.

To review, see K-means Clustering and Hierarchical Clustering.

 

8c. Explain similarities and differences between hierarchical clustering and K-means clustering 

  • How is K-means clustering similar to agglomerative clustering?
  • How does K-means clustering differ from agglomerative clustering?
  • How do these differences translate into application and syntax?

Both K-means and agglomerative clustering algorithms are distance-based techniques. K-means clustering is an iterative algorithm that uses point distances to define cluster membership, where clusters are defined by their centroids. Agglomerative clustering uses set distances (linkages) as the criteria for merging clusters. Furthermore, the agglomerative clustering algorithm constructs a tree (or dendrogram) as more clusters are joined using a distance matrix.

The most generic forms of the K-means algorithm require specifying the number of clusters. You should be familiar with the inertia_ input parameter for helping to determine a sensible cluster number. The agglomerative clustering algorithm allows for the number of clusters as an input; however, it can also be used to determine the number of clusters. This can be done, for example, by computing the full tree (by setting the compute_full_tree parameter to a value of True). The output distances_ can then be used to determine a distance_threshold so that a cutoff can be applied to determine the number of clusters. The associated dendrogram can help visualize the cutoff point.

To review, see Hierarchical Clustering.

 

8d. Validate models using clustering techniques 

  • How can clustering methods be applied to make meaningful predictions?
  • Which clustering technique allows for the use of the predict method?
  • How can supervised learning play a role after a set of clusters have been determined?

After applying a clustering technique, the silhouette_score from sklearn.metrics can compare the mean intra-cluster distance and the mean nearest-cluster distance. This allows for a principled way of evaluating the clusters. In general, once the set of clusters has been decided upon and labels have been assigned (that is, once the model has been fit), the unsupervised problem has been transformed into a supervised classification problem. Hence you should be comfortable applying any supervised learning method introduced in this course as a classification scheme for the derived clusters.

The K-means class allows for the use of the predict method where cluster membership for a given test vector is decided by choosing the closest centroid. This effectively reduces the class membership problem to a 1-nearest neighbor problem. Agglomerative clustering does not allow for the use of the predict method. However, you should feel comfortable taking the clusters and, for example, computing a centroid vector. Additionally, you should be prepared to apply the k-nearest neighbor algorithm using a set of clusters generated by either K-means or agglomerative clustering.

To review, see Training and Testing.

 

Unit 8 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • affinity
  • centroid
  • cluster.AgglomerativeClustering
  • cluster.KMeans
  • clustering
  • compute_full_tree
  • dendrogram
  • distance matrix
  • distances_
  • distance_threshold
  • inertia_
  • labels_
  • linkage
  • n_clusters
  • silhouette_score
  • unsupervised classification

Unit 9: Data Mining III – Statistical Modeling

9a. Explain linear regression concepts 

  • What is linear regression?
  • What quantities are of importance when performing linear regression?
  • What are some characteristics of linear regression computations?

Linear regression is a supervised linear modeling technique for finding an optimal linear fit to a dataset. Simple linear regression involves relating one dependent variable to one independent variable by determining a slope and intercept. Multiple linear regression involves more than one independent variable. An intercept value can be optional as it can be absorbed into the solution of the linear regression equations. The scikit-learn module produces the intercept, the statsmodels module allows for either explicit intercept computation or absorbing the constant into the regression equations. The optimal fit is measured by solving a statistical least squares problem where the expected value of the squared error is minimized.

Important quantities derived from linear regressions are the residuals (the difference between the observed values and the estimates) and the correlation coefficient (which can be between -1 and 1). A value near zero implies no correlation and that a linear model as an explanation of the data is unlikely. A positive correlation coefficient implies the dependent variable increases as the independent variable increases. A negative correlation coefficient implies the dependent and independent variables are inversely correlated. For the purposes of this course, the coefficient of determination (also known as the R2 score) is the square of the correlation coefficient. You should be aware of the equations for these quantities. Additionally, you should have some intuition regarding the trend of the data as it relates to the sign of the slope for simple linear regression. Lastly, for simple linear regression, you should know why the regression line passes through the mean values of the dependent and independent variables.

To review, see Linear Regression and Residuals.

 

9b. Apply the scikit-learn module to build linear regression models

  • What is the scikit-learn instantiation for linear regression?
  • What are important computational details for fitting a linear regression model?
  • What are key output attributes for implementing a scikit-learn linear regression?

The scikit-learn module offers the capacity to create linear regression models using the instantiation linear_model.LinearRegression. When fitting a model, it is important to be aware of the subtleties of how data is to be arranged. Since the data matrix for the independent variable is assumed to have dimensions (number of observations)-by-(number of variables), it must have two dimensions. In other words, if there is one dependent variable, it cannot be in the form of a numpy vector (that is, a single subscript array). If the data is in the form of a vector, it must first be reshaped using the reshape method before fitting the model.

With regard to the output of the linear regression model, it is important to know the syntax for referring to key quantities. The regression coefficients and the intercept can be referenced using the coef_ and intercept_ attributes. For multiple linear regression, the coef_ attribute will be in the form of a vector. For small datasets, using simple linear regression, you should be able to demonstrate your intuition by estimating the slope and the intercept to verify the output of a linear regression. Lastly, the coefficient of determination can be computed using the r2_score method contained within the sklearn.metrics class.

To review, see Linear Regression.

 

9c. Apply the scikit-learn module to validate linear regression models 

  • What methods are important for model validation?
  • What is cross-validation?
  • How is cross-validation implemented in practice?

As is customary for many machine learning modules, the predict and score methods can be applied to test data after fitting the model. It is important to note that the linear model can both interpolate within the bounds of the training set and extrapolate outside the bounds of the training set. It is, therefore, up to the data scientist to interpret the predictions. For example, if a simple linear regression can model the cost of some item, it would be unrealistic to consider values less than zero (even though the regression would be fine with processing such values).

Now that various topics on supervised and unsupervised learning have been covered, it is sensible to consider constructing more sophisticated model validation approaches such as cross-validation. In this approach, a test (or "validation") set is extracted from an overall dataset and is separate from the training set. The test and training data are selected randomly; therefore, several random tests can be performed to arrive at a statistical estimate of the model performance. You should feel comfortable in your Python programming capacity to construct cross-validation tests.

To review, see Linear Regression and Cross-Validation.

 

9d. Explain data overfitting 

  • What is overfitting?
  • What is underfitting?
  • What steps can be taken to avoid overfitting a model?

After a model has been trained, objective measures must be in place to ensure acceptable performance, and the data scientist must be able to spot when something is awry. Underfitting occurs when a mathematical model cannot adequately capture the underlying structure of the data. This can happen if the dataset is too complex for the model. Overfitting occurs when a model has too many parameters to be determined. In this case, the model is trained too well to the dataset and loses its ability to generalize. This means a quantity such as the mean squared training error could be relatively low while the model still cannot accurately handle data outside the training set.

To avoid overfitting and underfitting, you can apply cross-validation techniques. For example, it is common to partition computational training steps into epochs where a validation set is tested at the end of each epoch to ensure a given model retains its ability to generalize during the training session. In this way, the model is constantly being tested to avoid underfitting or overfitting, and both the training error and the validation error are kept small. For simple examples such as the two-class classification problem, you should be able to visualize a well-fitted versus an overfitted decision boundary between the class datasets.

To review, see Overfitting.

 

Unit 9 Vocabulary 

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • coefficient of determination
  • coef_
  • correlation coefficient
  • cross-validation
  • intercept_
  • linear_model.LinearRegression
  • overfitting
  • r2_score
  • residual
  • underfitting

Unit 10: Time Series Analysis

10a. Apply methods in the statsmodels module 

  • What is the statsmodels module?
  • How does it differ from scikit-learn?
  • What types of models can be implemented?

The statsmodels module is useful for creating statistical models, conducting statistical tests, and performing statistical data exploration. The intent of this module is to provide in Python the functionality of the R programming language for data science applications. It goes beyond the statistical capabilities of the scikit-learn module with more sophisticated models and more comprehensive data descriptions. While there is a measure of overlap between scikit-learn and statsmodels, the statsmodels module gears the arrangement of data output to the mindset of a data scientist. This module focuses on time series analysis models.

As a step toward introducing the module syntax and contrasting it with scikit-learn, it is important to consider the implementation of linear regression models. Objects for implementing a linear regression can be instantiated by invoking OLS (ordinary least squares). When using the fit method, you should be aware that the endogenous variable (the dependent variable) is placed first within the method call (observe, this convention is opposite to that of scikit-learn). It is also important to realize that a linear equation such as y=Ax+b where A is a matrix and b is a constant vector can mathematically be rephrased as y = Mz. This is because the constant value can be appended to the vector x to form the vector z; therefore, the matrix A would be appended with a vector of ones to form M. As opposed to scikit-learn, which produces the coeff_ and intercept_ attributes, the default mode for OLS is y=Mz. If it is desired to generate the intercept term, then the add_constant method must first be invoked. Output parameters are compacted into the params attribute from which the intercept and regression coefficients can be extracted. Because of how statsmodels constructs its data description, the summary command can be useful for visualizing all parameters and scores for a given model. Specific values contained within the summary require referencing a given attribute within a model. For example, the coefficient of determination can be referenced from a model using rsquared. Note that this is a more compact approach as scikit-learn requires invoking a separate method to generate this computation. Finally, an input parameter such as missing can sometimes be useful for OLS because it allows you to drop missing values if desired.

To review, see The statsmodels Module.

 

10b. Explain the autoregressive and moving average models 

  • What is an autoregressive model?
  • What is a moving average model?
  • What are the important parameters for defining ARMA and ARIMA models?

An autoregressive (AR) model is a stochastic process modeled by a recursion relation whose output depends upon the previous 'p' values in the time series. It is a random process because it contains a Gaussian white noise term that is added to the series at each time step. A moving average (MA) model is a stochastic process generated by a weighted average of 'q' previous Gaussian white noise terms in a time series. ARMA models are linear and important parameters for ARMA(p,q) models are the order parameters 'p' and 'q' and the mean and standard deviation of the white noise process.

A stationary random process, for the purposes of this course, is one where the mean and standard deviation remain constant as the time varies. If these parameters vary with time, then the process is called nonstationary. ARMA models apply to stationary processes. Important aspects of these models are the long-term behaviors of time series. For example, for an AR(1) process, you should know conditions on the model parameters such that the model output remains stable and be able to calculate the mean of an AR(1) process.

When numerically producing ARIMA(p,d,q) models from empirical nonstationary time series data, 'd' finite differences are applied before fitting the ARMA models. This is done to convert a nonstationary time series to a stationary one. After this step, ARMA techniques can be used to complete the model.

To review, see Autoregressive (AR) Models and Moving Average (MA) Models.

 

10c. Implement and analyze AR, MA, and ARIMA models 

  • How do you test a time series for stationarity?
  • How can ARMA model parameters be estimated?
  • How is an ARIMA model implemented?

The statsmodels module is equipped with several analysis tools to help determine the order parameters p,d,q in an ARIMA(p,d,q) model. For example, the augmented Dickey-Fuller (ADF) test adfuller can be used to test the null hypothesis that the data are not stationary and help to determine 'd' in the ARIMA(p,d,q) model. Such a test can be fed with 'd' finite differences of a nonstationary time series to determine if and at what point a stationary time series is found. The autocorrelation function (ACF) and partial autocorrelation function (PACF) can help determine the order parameters p and q in an ARMA(p,q) model and can be implemented using the acf and pacf methods. The sgt method from the graphics class can be used to visualize the PACF.

Once the values of p, d, and q are decided upon, the ARIMA model can be implemented by invoking ARIMA from statsmodels.tsa.arima_model. You should feel comfortable writing code that can generate a simple ARMA model from the recursion equation and fitting an ARIMA model to input time series data given the values of p, d, and q. Once a model has been fitted, you should be aware that the maparams attribute can be used to reference the model parameters. The summary method is also useful for visualizing the model results. You should, additionally, realize that p-values for the computed model coefficients can also be referenced using pvalues once the model has been created.

To review, see Autoregressive Integrated Moving Average (ARIMA) Models.

 

Unit 10 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • add_constant
  • ADF test
  • ARIMA
  • ARMA model
  • autocorrelation function
  • Endogenous variable
  • maparams
  • missing
  • OLS
  • params
  • partial autocorrelation function
  • rsquared
  • sgt
  • stationary random process
  • summary