It is time to exercise, reinforce, and apply various topics introduced throughout this unit. Study and practice section 1.6.6 to help review the basics of mixing numpy and scipy.stats for plotting data and running statistical tests. Be sure to import matplotlib so the plotting methods will run without exceptions when they are called.
Statistics and random numbers: scipy.stats
The module scipy.stats
contains statistical tools and probabilistic
descriptions of random processes. Random number generators for various
random processes can be found in numpy.random
.
Distributions: histogram and probability density function
Given observations of a random process, their histogram is an estimator of the random process's PDF (probability density function):
samples = np.random.normal(size=1000)
bins = np.arange(-4, 5)
bins
array ([-4, -3, -2, -1, 0, 1, 2, 3, 4])
histogram = np.histogram(samples, bins=bins, density=True)[0]
bins = 0.5*(bins[1:] + bins[:-1])
bins
array ([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])
from scipy import stats
pdf = stats.norm.pdf(bins) # norm is a distribution object
plt.plot(bins, histogram)
[<matplotlib.lines.line2D object at ...>]
plt.plot(bins, pdf)
[<matplotlib.lines.line2D object at ...>]
The distribution objects
scipy.stats.norm
is a distribution object: each distribution
in scipy.stats
is represented as an object. Here it's the
normal distribution, and it comes with a PDF, a CDF, and much more.
If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:
loc, std = stats.norm.fit(samples)
loc
-0.045256707...
std
0.9870331586
Exercise: Probability distributions
Generate 1000 random variates from a gamma distribution with a shape parameter of 1, then plot a histogram from those samples. Can you plot the pdf on top (it should match)?
Extra: the distributions have many useful methods. Explore them by
reading the docstring or by using tab completion. Can you recover
the shape parameter 1 by using the fit
method on your random
variates?
Mean, median, and percentiles
The mean is an estimator of the center of the distribution:
np.mean(samples)
-0.0452567074...
The median is another estimator of the center. It is the value with half of the observations below and half above:
np.median(samples)
-0.0580280347...
Unlike the mean, the median is not sensitive to the tails of the distribution. It is "robust".
Exercise: Compare mean and median on samples of a Gamma distribution
Which one seems to be the best estimator of the center for the Gamma distribution?
The median is also the percentile 50 because 50% of the observation are below it:
stats.scoreatpercentile(samples, 50)
-0.0580280347...
Similarly, we can calculate the percentile 90:
stats.scoreatpercentile(samples, 90)
1.2315935511...
The percentile is an estimator of the CDF: cumulative distribution function.
Statistical tests
A statistical test is a decision indicator. For instance, if we have two
sets of observations that we assume are generated from Gaussian
processes, we can use a
T-test to decide
whether the means of two sets of observations are significantly different:
a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)
(array(-3.177574054...), 0.0019370639...)
The resulting output is composed of:
- The T statistic value: it is a number, the sign of which is proportional to the difference between the two random processes, and the magnitude is related to the significance of this difference.
- the p value: the probability of both processes being identical. If it is close to 1, the two processes are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.
Source: Gaël Varoquaux, Adrien Chauve, Andre Espaze, Emmanuelle Gouillart, and Ralf Gommers, http://scipy-lectures.org/intro/scipy.html#statistics-and-random-numbers-scipy-stats
This work is licensed under a Creative Commons Attribution 4.0 License.