Unit 4: Applied Statistics in Python
As data science can often involve making statistical inferences from data, many of the upcoming units will apply calculations rooted in probability and statistics. This unit is foundational in that it discusses various ways of generating random data, computing basic statistical measures, and performing statistical analyses in Python. When you finish this unit, you will be able to implement and apply Python methods from the scipy.stats module.
You have already seen that the random module can generate scalar random numbers and that numpy can generate arrays of random numbers. We will also find that many numpy methods extend quite naturally to the pandas module we will introduce in the next unit. Additionally, the scipy.stats module allows for statistical modeling and parameter calculations. These Python implementations will serve as a foundation for more sophisticated methods discussed we will use later in the course.
Completing this unit should take you approximately 13 hours.
Upon successful completion of this unit, you will be able to:
- apply methods for random numbers within the numpy module;
- apply statistical methods within the scipy.stats module; and
- apply the scipy.stats module for solving data science problems.
4.1: Basic Statistical Measures and Distributions
At the core of data science are statistical methods that enable verifying hypotheses, drawing conclusions, making inferences, and forecasting. To understand how the numpy, matplotlib, and scipy modules are to be applied, the fundamentals of statistics must first be introduced (or reviewed). Watch from 4:36:30 to get an overview of how to apply statistics in this unit and throughout the course.
Before delving into deeper topics, it is important to be clear about fundamental terms such as probability, statistics, data, and sampling. Additionally, you should master quantities derivable from empirical data, such as frequency, relative frequency, and cumulative frequency.
Once data has been collected and categorized, visualizations and fundamental calculations help describe the data. The visualization approaches (such as bar charts, histograms, and box plots) and calculations (such as mean, median, and standard deviation) introduced here will be revisited and implemented using Python.
A random experiment is one where the set of possible outcomes is known, but the outcome is unknown for each experiment. Under these circumstances, given enough data, we can assign a probability to each possible outcome. A host of concepts can be developed from these basic notions, such as independence, mutual exclusivity, and conditional probability. Furthermore, rules governing calculations, such as adding probabilities and multiplying probabilities, naturally follow from these basic concepts.
Terms and definitions from descriptive statistics readily carry over to situations where values are countable. The set of outcomes for a coin flip, the roll of a dice, or a set of cards are all examples of discrete random variables. We can then put concepts such as mean and standard deviation on a firmer mathematical footing by defining the expected value and the variance of a discrete random variable.
Once you have grasped the notion of a discrete random variable, it should be clear that all random variables need not be discrete. For example, consider measuring the atmospheric temperature at some prescribed location. The measured temperature would be random and could take on a continuum of values (theoretically speaking). Under these circumstances, we say that the random variable is continuous. All the machinery developed for discrete random values (such as expected value, variance, and mean) must be elevated to continuous random variables to handle this situation. The uniform distribution (which you have programmed using the random module) is an example of a continuous probability distribution.
The normal distribution is an example of a continuous distribution. Because it arises so often when considering empirical measurements, it is fundamental to probability and statistics, and we must devote special attention to it. The normal distribution is used as the basis for many statistical tests. Hence, it is essential to understand its mathematical form, graph, z-score, and the area under the curve.
Calculating confidence intervals is fundamental to statistical analyses and statistical inference. This is because statistical calculations such as a mean or a probability are a function of the sample size with respect to the (possibly unknown) size of a larger population. Therefore, you must also include techniques for estimating your confidence in a given value along with the value. As you go deeper into the upcoming units, you will need to understand confidence intervals developed for the normal distribution and the student's t-distribution.
In addition to calculating confidence intervals, hypothesis testing is another way to make statistical inferences. This process involves considering two opposing hypotheses regarding a given data set (referred to as the null hypothesis and the alternative hypothesis). Hypothesis testing determines whether the null hypothesis can be accepted or rejected.
At various points throughout the course, it will be necessary to build a statistical model that can be used to estimate or forecast a value based on a set of given values. When one has data where a dependent variable is assumed to depend upon a set of independent variables, linear regression is often applied as the first step in data analysis. This is because parameters for the linear model are very tractable and easily interpreted. You will be implementing this model in various ways using Python.
4.2: Random Numbers in numpy
Data science requires the ability to process data often presented in the form of arrays. Furthermore, to test models designed for data mining or forecasting, it is often necessary to generate random arrays with desired statistical properties. As a first step toward understanding such implementations, you must learn how to use numpy to create arrays of random numbers and compute basic quantities such as the mean, the median, and the standard deviation.
Going beyond the basics of random numbers in the numpy module, it is important to see examples of how to compute using various distributions introduced at the beginning of this unit. The code introduced in these materials should be viewed as the array extension of scalar capabilities available within the random module.
Since the normal distribution is fundamental and arises so often in the field of statistical modeling, it is sensible to devote some attention to this subject in the context of numpy computations. This overview provides a simple example of how you can combine computation and visualization for statistical analysis.
By itself, numpy can make various statistical calculations (in the next section, you will see how scipy builds upon this foundation). Try running and studying the code in this project to experience a data science application that analyzes empirical speed of light measurements.
As a lead-in to the next unit, you should know three instructions from the pandas module (read_csv, rename, and head). The read_csv method is used to load the data from a file into what is called a pandas "data frame" (analogous to a numpy array, but more general). The rename method is used to rename a column within the data frame. The head method prints out and inspects the first few rows of a data frame containing many rows. These methods will be discussed in more detail in the next unit. For now, try and focus on the data science application and the statistical and plotting methods used to analyze the data.
4.3: The scipy.stats Module
The scipy module was constructed with numpy as its underlying foundation. The numpy module handles arrays efficiently, and scipy can be applied using a vast set of methods for scientific computation. In this unit, we are primarily concerned with applying the statistical power of the scpy.stats module, which, as you will see in this video, goes beyond the capabilities of numpy.
This video will enhance your understanding of how scipy.stats can be used. Use this tutorial to increase your vocabulary for statistical data processing. With this overview, we have enough Python syntax to get some applications up and running. It is best to begin implementations with some exercises in the next section.
Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.
4.4: Data Science Applications
It is time to exercise, reinforce, and apply various topics introduced throughout this unit. Study and practice section 1.6.6 to help review the basics of mixing numpy and scipy.stats for plotting data and running statistical tests. Be sure to import matplotlib so the plotting methods will run without exceptions when they are called.
Watch these videos to refine your Python coding skills with concepts such as modeling distributions from sampled data and confidence intervals. If any of the terms in this section are unfamiliar, go back to the first section of this unit to review.
Study these slides. In this project, you will apply techniques from this unit to analyze data sets using descriptive statistics and graphical tools. You will also write code to fit (that is, estimate distribution parameters) a probability distribution to the data. Finally, you will learn to code various risk measures based on statistical tests. Upon completing this project, you should have a clearer picture of how you can use Python to perform statistical analyses within the field of data science.
Unit 4 Assessment
- Receive a grade
Take this assessment to see how well you understood this unit.
- This assessment does not count towards your grade. It is just for practice!
- You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
- You can take this assessment as many times as you want, whenever you want.