Completion requirements
Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.
Discrete distributions
Categorical distribution
- Story. A probability is assigned to each of a set of discrete outcomes.
- Example. A hen will peck at grain A with probability
, grain B with probability
, and grain C with probability
.
- Parameters. The distribution is parametrized by the probabilities assigned to each event. We define
to be the probability assigned to outcome
. The set of
's are the parameters, and are constrained by
- Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to
N
, inclusive. - Probability mass function.
- Usage (with theta length
)
Package Syntax NumPy np.random.choice(len(theta), p=theta)
SciPy scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()
Stan categorical(theta)
- Related distributions.
- Notes.
- This distribution must be manually constructed if you are using the scipy.stats module using
scipy.stats.rv_discrete()
. The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF. - To sample out of a Categorical distribution, use
numpy.random.choice()
, specifying the values ofusing the p kwarg.
- This distribution must be manually constructed if you are using the scipy.stats module using
def categorical_pmf(x, θ1, θ2, θ3):
thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
if (thetas < 0).any():
return np.array([np.nan]*len(x))
return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
if x < 1:
return 0
elif x >= 4:
return 1
else:
return np.sum(thetas[:int(x)])
def categorical_cdf(x, θ1, θ2, θ3):
thetas = np.array([θ1, θ2, θ3, 1-θ1-θ2-θ3])
if (thetas < 0).any():
return np.array([np.nan]*len(x))
return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
x_max=4,
custom_pmf=categorical_pmf,
custom_cdf=categorical_cdf,
params=params,
x_axis_label='category',
title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)