Discrete distributions

Categorical distribution

  • Story. A probability is assigned to each of a set of discrete outcomes.
  • Example. A hen will peck at grain A with probability θ_A, grain B with probability θ_B, and grain C with probability θ_C.
  • Parameters. The distribution is parametrized by the probabilities assigned to each event. We define θ_y to be the probability assigned to outcome y. The set of θ_y's are the parameters, and are constrained by

    \begin{align}\sum_y \theta_y = 1\end{align}.

  • Support. If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive.
  • Probability mass function.

    \begin{align}f(y;\{\theta_y\}) = \theta_y\end{align}.

  • Usage (with theta length n)

    Package Syntax
    NumPy np.random.choice(len(theta), p=theta)
    SciPy scipy.stats.rv_discrete(values=(range(len(theta)), theta)).rvs()
    Stan categorical(theta)

  • Related distributions.
    • The Discrete Uniform distribution is a special case where all θ_y are equal.
    • The Bernoulli distribution is a special case where there are two categories that can be encoded as having outcomes of zero or one. In this case, the parameter for the Bernoulli distribution is θ=θ_0=1−θ_1.
  • Notes.
    • This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.
    • To sample out of a Categorical distribution, use numpy.random.choice() , specifying the values of θ using the p kwarg.
def categorical_pmf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return thetas[x-1]
def categorical_cdf_indiv(x, thetas):
    if x < 1:
        return 0
    elif x >= 4:
        return 1
    else:
        return np.sum(thetas[:int(x)])
    
def categorical_cdf(x, θ1, θ2, θ3):
    thetas = np.array([θ1, θ2, θ3, 1123])
    if (thetas < 0).any():
        return np.array([np.nan]*len(x))
    return np.array([categorical_cdf_indiv(x_val, thetas) for x_val in x])
params = [dict(name='θ1', start=0, end=1, value=0.2, step=0.01),
          dict(name='θ2', start=0, end=1, value=0.3, step=0.01),
          dict(name='θ3', start=0, end=1, value=0.1, step=0.01)]
app = distribution_plot_app(x_min=1,
                            x_max=4,
                            custom_pmf=categorical_pmf,
                            custom_cdf=categorical_cdf,
                            params=params,
                            x_axis_label='category',
                            title='Discrete categorical')
bokeh.io.show(app, notebook_url=notebook_url)