## CS250 Study Guide

### Unit 10: Time Series Analysis

#### 10a. Apply methods in the statsmodels module

- What is the statsmodels module?
- How does it differ from scikit-learn?
- What types of models can be implemented?

The statsmodels module is useful for creating statistical models, conducting statistical tests, and performing statistical data exploration. The intent of this module is to provide in Python the functionality of the R programming language for data science applications. It goes beyond the statistical capabilities of the scikit-learn module with more sophisticated models and more comprehensive data descriptions. While there is a measure of overlap between scikit-learn and statsmodels, the statsmodels module gears the arrangement of data output to the mindset of a data scientist. This module focuses on time series analysis models.

As a step toward introducing the module syntax and contrasting it with scikit-learn, it is important to consider the implementation of linear regression models. Objects for implementing a linear regression can be instantiated by invoking **OLS **(ordinary least squares). When using the fit method, you should be aware that the **endogenous variable** (the dependent variable) is placed first within the method call (observe, this convention is opposite to that of scikit-learn). It is also important to realize that a linear equation such as y=Ax+b where A is a matrix and b is a constant vector can mathematically be rephrased as y = Mz. This is because the constant value can be appended to the vector x to form the vector z; therefore, the matrix A would be appended with a vector of ones to form M. As opposed to scikit-learn, which produces the coeff_ and intercept_ attributes, the default mode for OLS is y=Mz. If it is desired to generate the intercept term, then the **add_constant **method must first be invoked. Output parameters are compacted into the **params **attribute from which the intercept and regression coefficients can be extracted. Because of how statsmodels constructs its data description, the **summary **command can be useful for visualizing all parameters and scores for a given model. Specific values contained within the summary require referencing a given attribute within a model. For example, the coefficient of determination can be referenced from a model using **rsquared**. Note that this is a more compact approach as scikit-learn requires invoking a separate method to generate this computation. Finally, an input parameter such as **missing **can sometimes be useful for OLS because it allows you to drop missing values if desired.

To review, see The statsmodels Module.

#### 10b. Explain the autoregressive and moving average models

- What is an autoregressive model?
- What is a moving average model?
- What are the important parameters for defining ARMA and ARIMA models?

An autoregressive (AR) model is a stochastic process modeled by a recursion relation whose output depends upon the previous 'p' values in the time series. It is a random process because it contains a Gaussian white noise term that is added to the series at each time step. A moving average (MA) model is a stochastic process generated by a weighted average of 'q' previous Gaussian white noise terms in a time series. ARMA models are linear and important parameters for ARMA(p,q) models are the order parameters 'p' and 'q' and the mean and standard deviation of the white noise process.

A **stationary random process**, for the purposes of this course, is one where the mean and standard deviation remain constant as the time varies. If these parameters vary with time, then the process is called nonstationary. ARMA models apply to stationary processes. Important aspects of these models are the long-term behaviors of time series. For example, for an AR(1) process, you should know conditions on the model parameters such that the model output remains stable and be able to calculate the mean of an AR(1) process.

When numerically producing ARIMA(p,d,q) models from empirical nonstationary time series data, 'd' finite differences are applied before fitting the ARMA models. This is done to convert a nonstationary time series to a stationary one. After this step, ARMA techniques can be used to complete the model.

To review, see Autoregressive (AR) Models and Moving Average (MA) Models.

#### 10c. Implement and analyze AR, MA, and ARIMA models

- How do you test a time series for stationarity?
- How can ARMA model parameters be estimated?
- How is an ARIMA model implemented?

The statsmodels module is equipped with several analysis tools to help determine the order parameters p,d,q in an ARIMA(p,d,q) model. For example, the augmented Dickey-Fuller (**ADF**) test adfuller can be used to test the null hypothesis that the data are not stationary and help to determine 'd' in the ARIMA(p,d,q) model. Such a test can be fed with 'd' finite differences of a nonstationary time series to determine if and at what point a stationary time series is found. The **autocorrelation function** (ACF) and **partial autocorrelation function** (PACF) can help determine the order parameters p and q in an ARMA(p,q) model and can be implemented using the acf and pacf methods. The **sgt **method from the graphics class can be used to visualize the PACF.

Once the values of p, d, and q are decided upon, the ARIMA model can be implemented by invoking **ARIMA **from statsmodels.tsa.arima_model. You should feel comfortable writing code that can generate a simple ARMA model from the recursion equation and fitting an ARIMA model to input time series data given the values of p, d, and q. Once a model has been fitted, you should be aware that the **maparams **attribute can be used to reference the model parameters. The summary method is also useful for visualizing the model results. You should, additionally, realize that p-values for the computed model coefficients can also be referenced using pvalues once the model has been created.

To review, see Autoregressive Integrated Moving Average (ARIMA) Models.

#### Unit 10 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

- add_constant
- ADF test
- ARIMA
- ARMA model
- autocorrelation function
- Endogenous variable
- maparams
- missing
- OLS
- params
- partial autocorrelation function
- rsquared
- sgt
- stationary random process
- summary