Probability Distributions and their Stories

Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.

Discrete distributions

Categorical distribution

  • Story. A probability is assigned to each of a set of discrete outcomes.
  • Example. A hen will peck at grain A with probability θ_A, grain B with probability θ_B, and grain C with probability θ_C.
  • Parameters. The distribution is parametrized by the probabilities assigned to each event. We define θ_y to be the probability assigned to outcome y. The set of θ_y's are the parameters, and are constrained by

    \begin{align}\sum_y \theta_y = 1\end{align}.

  • Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive.
  • Probability mass function.

    \begin{align}f(y;\{\theta_y\}) = \theta_y\end{align}.

  • Usage (with theta length n)

    Package Syntax
    NumPy np.random.choice(len(theta), p=theta)
    SciPy scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()
    Stan categorical(theta)

  • Related distributions.
    • The Discrete Uniform distribution is a special case where all θ_y are equal.
    • The Bernoulli distribution is a special case where there are two categories that can be encoded as having outcomes of zero or one. In this case, the parameter for the Bernoulli distribution is θ=θ_0=1−θ_1.
  • Notes.
    • This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.
    • To sample out of a Categorical distribution, use numpy.random.choice() , specifying the values of θ using the p kwarg.
def categorical_pmf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
    if x < 1:
        return 0
    elif x >= 4:
        return 1
    else:
        return np.sum(thetas[:int(x)])
    
def categorical_cdf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
          dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
          dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
                            x_max=4,
                            custom_pmf=categorical_pmf,
                            custom_cdf=categorical_cdf,
                            params=params,
                            x_axis_label='category',
                            title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)