Probability Distributions and their Stories

Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module.

Discrete multivariate distributions

So far, we have looked a univariate distributions, but we will consider multivariate distributions in class, and you will encounter them in your research. First, we consider a discrete multivariate distribution, the Multinomial.


Multinomial distribution

  • Story. This is a generalization of the Binomial distribution. Instead of a Bernoulli trial consisting of two outcomes, each trial has K outcomes. The probability of getting n_1 of outcome 1, n_2 of outcome 2, ..., and y_K of outcome K out of a total of N trials is Multinomially distributed.
  • Example. There are two alleles in a population, A and a. Each individual may have genotype AA, Aa, or aa. The probability distribution describing having y_1 AA individuals, y_2 Aa individuals, and n_3 aa individuals in a population of N total individuals is Multinomially distributed.
  • Parameters. N, the total number of trials, and θ={θ_1,θ_2,…θ_k}, the probabilities of each outcome. Note that ∑_i θ_i=1 and there is a further restriction that ∑_i y_i=N.
  • Support. The K-nomial distribution is supported on \mathbb{N}^K.
  • Usage The usage below assumes that theta is a length K array.

    Package Syntax
    NumPy np.random.multinomial(N, theta)
    SciPy scipy.stats.multinomial(N, theta)
    Stan sampling multinomial(theta)
    Stan rng multinomial_rng(theta, N)

  • Probability density function.

    \begin{align}f(\mathbf{y};\mathbf{\theta}, N) = \frac{N!}{y_1!\,y_2!\cdots y_k!}\,\theta_1^{y_1}\,\theta_2^{y_2}\cdots \theta_K^{y_K}\end{align}

  • Related distributions.
    • The Multinomial distribution generalizes the Binomial distribution to multiple dimensions.
  • Notes.
    • For a sampling statement in Stan, the value of N is implied