Completion requirements
Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.
Discrete distributions
Hypergeometric distribution
Story. Consider an urn with \(a\) white balls and \(b\) black balls. Draw \(N\) balls from this urn without replacement. The number white balls drawn, \(n\), is Hypergeometrically distributed.
- Example. There are \(a+b\) finches on an island, and \(a\) of them are tagged (and therefore \(b\) of them are untagged). You capture \(N\) finches. The number of tagged finches \(n\) is Hypergeometrically distributed, \(f(n;N,a,b)\), as defined below.
- Parameters. There are three parameters: the number of draws \(N\), the number of white balls \(a\), and the number of black balls \(b\).
- Support. The Hypergeometric distribution is supported on the set of integers between \(max(0,N−b)\) and \(min(N,a)\), inclusive.
- Probability mass function.
\(\begin{align}
f(n;N, a, b) = \frac{\begin{pmatrix}a\\n\end{pmatrix}\begin{pmatrix}b\\N-n\end{pmatrix}}
{\begin{pmatrix}a+b\\N\end{pmatrix}}
\end{align}\).
- Usage
Package Syntax NumPy np.random.hypergeometric(a, b, N)
SciPy scipy.stats.hypergeom(a+b, a, N)
Stan hypergeometric(N, a, b)
- Related distributions.
- In the limit of \(a+b→∞\) such that \(a/(a+b)\) is fixed, we get a Binomial distribution with parameters \(N=N\) and \(θ=a/(a+b)\).
- Notes.
- This distribution is analogous to the Binomial distribution, except that the Binomial distribution describes draws from an urn with replacement. In the analogy, \(p=a/(a+b)\).
- SciPy uses a different parametrization. Let \(M=a+b\) be the total number of balls in the urn. Then, noting the order of the parameters, since this is what scipy.stats.hypergeom expects,
\(\begin{align}
\\ \phantom{blah}
f(n;M, a, N) = \frac{\begin{pmatrix}a\\n\end{pmatrix}\begin{pmatrix}M-a\\N-n\end{pmatrix}}
{\begin{pmatrix}M\\n\end{pmatrix}}.
\\ \phantom{blah}
\end{align}\)
- The random number generator in
numpy.random
has a different parametrization than in thescipy.stats
module. Thenumpy.random.hypergeom()
function uses the same parametrization as Stan, except the parameters are given in the order a, b, N, not N, a, b, as in Stan. - When using the sliders below, you will only get a plot if \(N ≤ a+b\) because you cannot draw more balls out of the urn than are actually in there.
params = [dict(name='N', start=1, end=20, value=10, step=1),
dict(name='a', start=1, end=20, value=10, step=1),
dict(name='b', start=1, end=20, value=10, step=1)]
app = distribution_plot_app(x_min=0,
x_max=40,
scipy_dist=st.hypergeom,
params=params,
transform=lambda N, a, b: (a+b, a, N),
x_axis_label='n',
title='Hypergeometric')
bokeh.io.show(app, notebook_url=notebook_url)