Statistics and Random Numbers

It is time to exercise, reinforce, and apply various topics introduced throughout this unit. Study and practice section 1.6.6 to help review the basics of mixing numpy and scipy.stats for plotting data and running statistical tests. Be sure to import matplotlib so the plotting methods will run without exceptions when they are called.

Statistics and random numbers: scipy.stats

The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random processes can be found in numpy.random.


Distributions: histogram and probability density function

Given observations of a random process, their histogram is an estimator of the random process's PDF (probability density function):

samples = np.random.normal(size=1000)
bins = np.arange(-4, 5)
bins
array ([-4, -3, -2, -1, 0, 1, 2, 3, 4])
histogram = np.histogram(samples, bins=bins, density=True)[0]
bins = 0.5*(bins[1:] + bins[:-1])
bins
array ([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])
from scipy import stats
pdf = stats.norm.pdf(bins)  # norm is a distribution object
plt.plot(bins, histogram) 
[<matplotlib.lines.line2D object at ...>]
plt.plot(bins, pdf) 
[<matplotlib.lines.line2D object at ...>]




The distribution objects

scipy.stats.norm is a distribution object: each distribution in scipy.stats is represented as an object. Here it's the normal distribution, and it comes with a PDF, a CDF, and much more.

If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:

loc, std = stats.norm.fit(samples)
loc     
-0.045256707...
std     
0.9870331586


Exercise: Probability distributions

Generate 1000 random variates from a gamma distribution with a shape parameter of 1, then plot a histogram from those samples. Can you plot the pdf on top (it should match)?

Extra: the distributions have many useful methods. Explore them by reading the docstring or by using tab completion. Can you recover the shape parameter 1 by using the fit method on your random variates?


Mean, median, and percentiles

The mean is an estimator of the center of the distribution:

np.mean(samples)     
-0.0452567074...

The median is another estimator of the center. It is the value with half of the observations below and half above:

np.median(samples)   
-0.0580280347...

Unlike the mean, the median is not sensitive to the tails of the distribution. It is "robust".

Exercise: Compare mean and median on samples of a Gamma distribution

Which one seems to be the best estimator of the center for the Gamma distribution?

The median is also the percentile 50 because 50% of the observation are below it:

stats.scoreatpercentile(samples, 50)     
-0.0580280347...

Similarly, we can calculate the percentile 90:

stats.scoreatpercentile(samples, 90)     
1.2315935511...

The percentile is an estimator of the CDF: cumulative distribution function.


Statistical tests

A statistical test is a decision indicator. For instance, if we have two sets of observations that we assume are generated from Gaussian processes, we can use a T-test to decide whether the means of two sets of observations are significantly different:


a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)  
(array(-3.177574054...), 0.0019370639...)

The resulting output is composed of:

  • The T statistic value: it is a number, the sign of which is proportional to the difference between the two random processes, and the magnitude is related to the significance of this difference.
  • the p value: the probability of both processes being identical. If it is close to 1, the two processes are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.

Source: Gaël Varoquaux, Adrien Chauve, Andre Espaze, Emmanuelle Gouillart, and Ralf Gommers, http://scipy-lectures.org/intro/scipy.html#statistics-and-random-numbers-scipy-stats
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Last modified: Friday, September 23, 2022, 2:47 PM