Probability Distributions and their Stories

Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.

Continuous distributions

Gaussian, a.k.a. Normal, distribution

  • Story. Any quantity that emerges as the sum of a large number of subprocesses tends to be Gaussian distributed provided none of the subprocesses is very broadly distributed.
  • Example. We measure the length of many C. elegans eggs. The lengths are Gaussian distributed. Many biological measurements, like the height of people, are (approximately) Gaussian distributed. Many processes contribute to setting the length of an egg or the height of a person.
  • Parameters. The Gaussian distribution has two parameters, the mean μ, which determines the location of its peak, and the standard deviation σ, which is strictly positive (the σ→0 limit defines a Dirac delta function) and determines the width of the peak.
  • Support. The Gaussian distribution is supported on the set of real numbers.
  • Probability density function.

    \begin{align}f(y;\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(y-\mu)^2/2\sigma^2}\end{align}
  • Usage

    Package Syntax
    NumPy np.random.normal(mu, sigma)
    SciPy scipy.stats.norm(mu, sigma)
    Stan normal(mu, sigma)

  • Related distributions. The Gaussian distribution is a limiting distribution in the sense of the central limit theorem, but also in that many distributions have a Gaussian distribution as a limit. This is seen by formally taking limits of, e.g., the Gamma, Student-t, Binomial distributions, which allows direct comparison of parameters.
  • Notes.
    • SciPy, NumPy, and Stan all refer to the Gaussian distribution as the Normal distribution.
params = [dict(name='µ', start=-0.5, end=0.5, value=0, step=0.01),
          dict(name='σ', start=0.1, end=1.0, value=0.2, step=0.01)]
app = distribution_plot_app(x_min=-2,
                            x_max=2,
                            scipy_dist=st.norm,
                            params=params,
                            x_axis_label='y',
                            title='Gaussian, a.k.a. Normal')
bokeh.io.show(app, notebook_url=notebook_url)