Distribution Fitting

# Distribution Fitting

This package provides methods to fit a distribution to a given set of samples. Generally, one may write

``d = fit(D, x)``

This statement fits a distribution of type `D` to a given dataset `x`, where `x` should be an array comprised of all samples. The fit function will choose a reasonable way to fit the distribution, which, in most cases, is maximum likelihood estimation.

Note

One can use as first argument simply the distribution name, like `Binomial`, or a concrete distribution with a type parameter, like `Normal{Float64}` or `Exponential{Float32}`. However, in the latter case the type parameter of the distribution will be ignored:

``````julia> fit(Cauchy{Float32}, collect(-4:4))
Cauchy{Float64}(μ=0.0, σ=2.0)``````

## Maximum Likelihood Estimation

The function `fit_mle` is for maximum likelihood estimation.

### Synopsis

``fit_mle(D, x)``

Fit a distribution of type `D` to a given data set `x`.

• For univariate distribution, x can be an array of arbitrary size.
• For multivariate distribution, x should be a matrix, where each column is a sample.
source
``fit_mle(D, x, w)``

Fit a distribution of type `D` to a weighted data set `x`, with weights given by `w`.

Here, `w` should be an array with length `n`, where `n` is the number of samples contained in `x`.

source

### Applicable distributions

The `fit_mle` method has been implemented for the following distributions:

Univariate:

Multivariate:

For most of these distributions, the usage is as described above. For a few special distributions that require additional information for estimation, we have to use modified interface:

``````fit_mle(Binomial, n, x)        # n is the number of trials in each experiment
fit_mle(Binomial, n, x, w)

fit_mle(Categorical, k, x)     # k is the space size (i.e. the number of distinct values)
fit_mle(Categorical, k, x, w)

fit_mle(Categorical, x)        # equivalent to fit_mle(Categorical, max(x), x)
fit_mle(Categorical, x, w)``````

## Sufficient Statistics

For many distributions, estimation can be based on (sum of) sufficient statistics computed from a dataset. To simplify implementation, for such distributions, we implement `suffstats` method instead of `fit_mle` directly:

``````ss = suffstats(D, x)        # ss captures the sufficient statistics of x
ss = suffstats(D, x, w)     # ss captures the sufficient statistics of a weighted dataset

d = fit_mle(D, ss)          # maximum likelihood estimation based on sufficient stats``````

When `fit_mle` on `D` is invoked, a fallback `fit_mle` method will first call `suffstats` to compute the sufficient statistics, and then a `fit_mle` method on sufficient statistics to get the result. For some distributions, this way is not the most efficient, and we specialize the `fit_mle` method to implement more efficient estimation algorithms.

## Maximum-a-Posteriori Estimation

Maximum-a-Posteriori (MAP) estimation is also supported by this package, which is implemented as part of the conjugate exponential family framework (see :ref:`Conjugate Prior and Posterior <ref-conj>`).