# Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

## Moments

`Statistics.var`

— Function`var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the variance of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample variance is defined as:

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`Statistics.std`

— Function`std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the standard deviation of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample standard deviation is defined as:

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`StatsBase.mean_and_var`

— Function`mean_and_var(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, var)`

Return the mean and standard deviation of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if `corrected=true`

. See `var`

documentation for more details.

`StatsBase.mean_and_std`

— Function`mean_and_std(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, std)`

Return the mean and standard deviation of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if `corrected=true`

. See `std`

documentation for more details.

`StatsBase.skewness`

— Function`skewness(v, [wv::AbstractWeights], m=mean(v))`

Compute the standardized skewness of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.kurtosis`

— Function`kurtosis(v, [wv::AbstractWeights], m=mean(v))`

Compute the excess kurtosis of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.moment`

— Function`moment(v, k, [wv::AbstractWeights], m=mean(v))`

Return the `k`

th order central moment of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

## Measurements of Variation

`StatsBase.span`

— Function`span(x)`

Return the span of a collection, i.e. the range `minimum(x):maximum(x)`

. The minimum and maximum of `x`

are computed in one pass using `extrema`

.

`StatsBase.variation`

— Function`variation(x, m=mean(x))`

Return the coefficient of variation of collection `x`

, optionally specifying a precomputed mean `m`

. The coefficient of variation is the ratio of the standard deviation to the mean.

`StatsBase.sem`

— Function`sem(x)`

Return the standard error of the mean of collection `x`

, i.e. `sqrt(var(x, corrected=true) / length(x))`

.

`StatsBase.mad`

— Function`mad(x; center=median(x), normalize=true)`

Compute the median absolute deviation (MAD) of collection `x`

around `center`

(by default, around the median).

If `normalize`

is set to `true`

, the MAD is multiplied by `1 / quantile(Normal(), 3/4) ≈ 1.4826`

, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

## Z-scores

`StatsBase.zscore`

— Function`zscore(X, [μ, σ])`

Compute the z-scores of `X`

, optionally specifying a precomputed mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

`μ`

and `σ`

should be both scalars or both arrays. The computation is broadcasting. In particular, when `μ`

and `σ`

are arrays, they should have the same size, and `size(μ, i) == 1 || size(μ, i) == size(X, i)`

for each dimension.

`StatsBase.zscore!`

— Function`zscore!([Z], X, μ, σ)`

Compute the z-scores of an array `X`

with mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array `Z`

is provided, the scores are stored in `Z`

and it must have the same shape as `X`

. Otherwise `X`

is overwritten.

## Entropy and Related Functions

`StatsBase.entropy`

— Function`entropy(p, [b])`

Compute the entropy of a collection of probabilities `p`

, optionally specifying a real number `b`

such that the entropy is scaled by `1/log(b)`

. Elements with probability 0 or 1 add 0 to the entropy.

`StatsBase.renyientropy`

— Function`renyientropy(p, α)`

Compute the Rényi (generalized) entropy of order `α`

of an array `p`

.

`StatsBase.crossentropy`

— Function`crossentropy(p, q, [b])`

Compute the cross entropy between `p`

and `q`

, optionally specifying a real number `b`

such that the result is scaled by `1/log(b)`

.

`StatsBase.kldivergence`

— Function`kldivergence(p, q, [b])`

Compute the Kullback-Leibler divergence from `q`

to `p`

, also called the relative entropy of `p`

with respect to `q`

, that is the sum `pᵢ * log(pᵢ / qᵢ)`

. Optionally a real number `b`

can be specified such that the divergence is scaled by `1/log(b)`

.

## Quantile and Related Functions

`StatsBase.percentile`

— Function`percentile(x, p)`

Return the `p`

th percentile of a collection `x`

, i.e. `quantile(x, p / 100)`

.

`StatsBase.iqr`

— Function`iqr(x)`

Compute the interquartile range (IQR) of collection `x`

, i.e. the 75th percentile minus the 25th percentile.

`StatsBase.nquantile`

— Function`nquantile(x, n::Integer)`

Return the n-quantiles of collection `x`

, i.e. the values which partition `v`

into `n`

subsets of nearly equal size.

Equivalent to `quantile(x, [0:n]/n)`

. For example, `nquantiles(x, 5)`

returns a vector of quantiles, respectively at `[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]`

.

`Statistics.quantile`

— Function`quantile(v, w::AbstractWeights, p)`

Compute the weighted quantiles of a vector `v`

at a specified set of probability values `p`

, using weights given by a weight vector `w`

(of type `AbstractWeights`

). Weights must not be negative. The weights and data vectors must have the same length. `NaN`

is returned if `x`

contains any `NaN`

values. An error is raised if `w`

contains any `NaN`

values.

With `FrequencyWeights`

, the function returns the same result as `quantile`

for a vector with repeated values. Weights must be integers.

With non `FrequencyWeights`

, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of `v`

such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted `quantile`

.

`Statistics.median`

— Method`median(v::RealVector, w::AbstractWeights)`

Compute the weighted median of `v`

with weights `w`

(of type `AbstractWeights`

). See the documentation for `quantile`

for more details.

## Mode and Modes

`StatsBase.mode`

— Function`mode(a, [r])`

Return the mode (most common number) of an array, optionally over a specified range `r`

. If several modes exist, the first one (in order of appearance) is returned.

`StatsBase.modes`

— Function`modes(a, [r])::Vector`

Return all modes (most common numbers) of an array, optionally over a specified range `r`

.

## Summary Statistics

`StatsBase.summarystats`

— Function`summarystats(a)`

Compute summary statistics for a real-valued array `a`

. Returns a `SummaryStats`

object containing the mean, minimum, 25th percentile, median, 75th percentile, and maxmimum.

`DataAPI.describe`

— Function`describe(a)`

Pretty-print the summary statistics provided by `summarystats`

: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.