Probability Distributions and their Stories

Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.

Discrete distributions

Hypergeometric distribution

Story. Consider an urn with a white balls and b black balls. Draw N balls from this urn without replacement. The number white balls drawn, n, is Hypergeometrically distributed.
  • Example. There are a+b finches on an island, and a of them are tagged (and therefore b of them are untagged). You capture N finches. The number of tagged finches n is Hypergeometrically distributed, f(n;N,a,b), as defined below.
  • Parameters. There are three parameters: the number of draws N, the number of white balls a, and the number of black balls b.
  • Support. The Hypergeometric distribution is supported on the set of integers between max(0,N−b) and min(N,a), inclusive.
  • Probability mass function.

    \begin{align}f(n;N, a, b) = \frac{\begin{pmatrix}a\\n\end{pmatrix}\begin{pmatrix}b\\N-n\end{pmatrix}}{\begin{pmatrix}a+b\\N\end{pmatrix}}\end{align}.

  • Usage

    Package Syntax
    NumPy np.random.hypergeometric(a, b, N)
    SciPy scipy.stats.hypergeom(a+b, a, N)
    Stan hypergeometric(N, a, b)

  • Related distributions.
    • In the limit of a+b→∞ such that a/(a+b) is fixed, we get a Binomial distribution with parameters N=N and θ=a/(a+b).
  • Notes.
  • This distribution is analogous to the Binomial distribution, except that the Binomial distribution describes draws from an urn with replacement. In the analogy, p=a/(a+b).
  • SciPy uses a different parametrization. Let M=a+b be the total number of balls in the urn. Then, noting the order of the parameters, since this is what scipy.stats.hypergeom expects,

    \begin{align}\\ \phantom{blah}f(n;M, a, N) = \frac{\begin{pmatrix}a\\n\end{pmatrix}\begin{pmatrix}M-a\\N-n\end{pmatrix}}{\begin{pmatrix}M\\n\end{pmatrix}}.  \\ \phantom{blah}\end{align}

  • The random number generator in numpy.random has a different parametrization than in the scipy.stats module. The numpy.random.hypergeom() function uses the same parametrization as Stan, except the parameters are given in the order a, b, N, not N, a, b, as in Stan.
  • When using the sliders below, you will only get a plot if N ≤ a+b because you cannot draw more balls out of the urn than are actually in there.
params = [dict(name='N', start=1, end=20, value=10, step=1),
          dict(name='a', start=1, end=20, value=10, step=1),
          dict(name='b', start=1, end=20, value=10, step=1)]
app = distribution_plot_app(x_min=0,
                            x_max=40,
                            scipy_dist=st.hypergeom,
                            params=params,
                            transform=lambda N, a, b: (a+b, a, N),
                            x_axis_label='n',
                            title='Hypergeometric')
bokeh.io.show(app, notebook_url=notebook_url)