The last step in this introduction to data science requires us to deal with data derived from time series, such as stock prices as a function of time. All the tools from the earlier units will play a role in performing these analyses. As in the last unit, our goal is to build statistical models that allow for inference and prediction.
This unit introduces models for analyzing time-series data. The statsmodels module contains various analysis tools, including methods for handling stationary and nonstationary data. This unit will focus on constructing autoregressive, moving average, and autoregressive integrated moving average models. This unit will teach you how to implement Python programs capable of statistical inference and forecasting from time-series data.
Completing this unit should take you approximately 6 hours.
Many Python modules that have statistical capabilities are not completely disjoint. You probably have noticed that there is some measure of overlap between scipy.stats, numpy, pandas, and sckit-learn (for example, scipy.stats can perform linear regression using the linregress method). This is to simplify the import process when making basic statistical calculations on arrays and dataframe data. On the other hand, there comes a point where major differences become obvious. This motivating example compares the functionality of the linregress method against the ols method from statsmodels. Follow this tutorial to see how the statsmodels module improves upon a module such as scipy.stats when building statistical models.
This example is similar to the previous but constructs a simple data set to easily digest the report results generated by statsmodels.
This tutorial is designed to help you jump from the scikit-learn module to statsmodels. Practice the code examples in order to thoroughly grasp the differences. The housing dataset USA_Housing.csv in this tutorial is available here or on the Kaggle website, as mentioned in the video. You can download this file to your local drive. If you are using Google Colab, you can use the instructions outlined in subunit 5.1 of this course for loading a local file.
A time series is a set of points that are ordered in time. Each time point is usually assigned an integer index to indicate its position within the series. For example, you can construct a time series by measuring and computing an average daily temperature. When the outcome of the next point in a time series is unknown, the time series is said to be random or "stochastic" in nature. A simple example would be creating a time series from a sequential set of coin flips with outcomes of either heads or tails. A more practical example is the time series of prices of a given stock.
When the unconditional joint probability distribution of the series does not change with time (it is time-invariant), the stochastic process generating the time series is said to be stationary. Under these circumstances, parameters such as the mean and standard deviation do not change over time. Assuming the same coin for each flip, the coin flip is an example of a stationary process. On the other hand, stock price data is not a stationary process.
This unit aims to use your knowledge of statistics to model time series data for random processes. Even though the outcome of the next time point is unknown, given the time series statistics, it should be possible to make inferences if you can create a model. The concept of a stationary random process is central to statistical model building. Since nonstationary processes require a bit more sophistication than stationary processes, it is important to understand what type of time series is being modeled. Our first step in this direction is to introduce the autoregressive (AR) model. This linear model can be used to estimate current time series values based on known past time series values. Read through this article which introduces the idea behind AR models and additionally explains the autocorrelation function (ACF).
This tutorial introduces time series analysis and concludes with coding the AR model using statsmodels. Follow along with the programming example for practice. Note that statsmodels.tsa.AR has been deprecated in favor of statsmodels.tsa.AutoReg due to processing improvements within statsmodels.
Since AR models only look back over a finite number of samples, they need time to adjust to unexpected shocks in a time series. You must model past instances of the input noise to handle unforeseen shocks. Moving average (MA) models can be used for this purpose. Read this article to learn the general structure of the MA model.
This tutorial provides several examples of MA models of various orders. In addition, the partial autocorrelation (PACF) function is introduced. The ACF and PACF are important tools for estimating the order of a model based on empirical data.
This video summarizes the key points regarding AR and MA models. In general, stationary time series modeling requires a balance between these two approaches. In the next section, you will learn how to combine them and apply them in time series analysis.
The autoregressive integrated moving average (ARIMA) model is an approach for nonstationary time series. It applies a combination of AR and MA modeling to balance out time series variances that can occur within a stochastic process. Additionally, it is often possible to convert a nonstationary time series to stationary series by taking successive differences. The "I" in ARIMA stands for the number of differences needed to eliminate nonstationary behavior. Read this article to get an overview of the mathematical form of the ARIMA(p,d,q) approach to model building. Take note of how you can use various choices of the p, d, and q parameters to form AR, MA, ARMA, or ARIMA models.
Use this tutorial to implement an ARIMA model and make forecasts. General reference is made to a data set, but you must obtain your own CSV file for actual data. A great source for data scientists is Kaggle. With your current expertise, you should be able to search for and download a .csv file with stock price data that is not too large (<50MB). Additionally, as illustrated in the tutorial, you can apply pandas to extract a column of data.
This tutorial delves a bit deeper into statistical models. Study it to better understand the ARIMA and seasonal ARIMA models. Consider closely the discussion of how to apply the ACF and PACF to estimate the order parameters for a given model. In practical circumstances, this is an important question as it is often the case that such parameters would initially be unknown.
Here is a practical application of the ARIMA model. Although this tutorial makes brief references to the R language, you should use it to tie together the concepts (AR, MA, ACF, and PACF) presented in this unit.
This tutorial demonstrates how to implement the models and forecasting discussed in this unit. Since we are using Google Colab, you can jump to Step 2 to begin this programming example. Upon completing this tutorial, you should be able to construct models, make forecasts and validate forecasts given a time series data set.
Take this assessment to see how well you understood this unit.