Discrete distributions

Categorical distribution

  • Story. A probability is assigned to each of a set of discrete outcomes.
  • Example. A hen will peck at grain A with probability \(θ_A\), grain B with probability \(θ_B\), and grain C with probability \(θ_C\).
  • Parameters. The distribution is parametrized by the probabilities assigned to each event. We define \(θ_y\) to be the probability assigned to outcome \(y\). The set of \(θ_y\)'s are the parameters, and are constrained by

    \(\begin{align}
    \sum_y \theta_y = 1
    \end{align}\)
    .

  • Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive.
  • Probability mass function.

    \(\begin{align}
    f(y;\{\theta_y\}) = \theta_y
    \end{align}\)
    .

  • Usage (with theta length \(n\))

    Package Syntax
    NumPy np.random.choice(len(theta), p=theta)
    SciPy scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()
    Stan categorical(theta)

  • Related distributions.
    • The Discrete Uniform distribution is a special case where all \(θ_y\) are equal.
    • The Bernoulli distribution is a special case where there are two categories that can be encoded as having outcomes of zero or one. In this case, the parameter for the Bernoulli distribution is \(θ=θ_0=1−θ_1\).
  • Notes.
    • This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.
    • To sample out of a Categorical distribution, use numpy.random.choice() , specifying the values of \(θ\) using the p kwarg.
def categorical_pmf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
    if x < 1:
        return 0
    elif x >= 4:
        return 1
    else:
        return np.sum(thetas[:int(x)])
    
def categorical_cdf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
          dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
          dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
                            x_max=4,
                            custom_pmf=categorical_pmf,
                            custom_cdf=categorical_cdf,
                            params=params,
                            x_axis_label='category',
                            title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)