Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Moments

Base.var — Method.

var(x, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

Base.std — Method.

std(v, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

StatsBase.mean_and_var — Function.

mean_and_var(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, var)

Return the mean and variance of a real-valued array x, optionally over a dimension dim, as a tuple. Observations in x can be weighted using weight vector w. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

StatsBase.mean_and_std — Function.

mean_and_std(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, std)

Return the mean and standard deviation of a real-valued array x, optionally over a dimension dim, as a tuple. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

StatsBase.skewness — Function.

skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

StatsBase.kurtosis — Function.

kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

StatsBase.moment — Function.

moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

Measurements of Variation

StatsBase.span — Function.

span(x)

Return the span of an integer array, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one-pass using extrema.

StatsBase.variation — Function.

variation(x, m=mean(x))

Return the coefficient of variation of an array x, optionally specifying a precomputed mean m. The coefficient of variation is the ratio of the standard deviation to the mean.

StatsBase.sem — Function.

sem(a)

Return the standard error of the mean of a, i.e. sqrt(var(a) / length(a)).

StatsBase.mad — Function.

mad(v; center=median(v), normalize=true)

Compute the median absolute deviation (MAD) of v around center (by default, around the median).

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

Z-scores

StatsBase.zscore — Function.

zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

StatsBase.zscore! — Function.

zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

Entropy and Related Functions

StatsBase.entropy — Function.

entropy(p, [b])

Compute the entropy of an array p, optionally specifying a real number b such that the entropy is scaled by 1/log(b).

StatsBase.renyientropy — Function.

renyientropy(p, α)

Compute the Rényi (generalized) entropy of order α of an array p.

StatsBase.crossentropy — Function.

crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

StatsBase.kldivergence — Function.

kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence of q from p, optionally specifying a real number b such that the divergence is scaled by 1/log(b).

Quantile and Related Functions

StatsBase.percentile — Function.

percentile(v, p)

Return the pth percentile of a real-valued array v, i.e. quantile(x, p / 100).

StatsBase.iqr — Function.

iqr(v)

Compute the interquartile range (IQR) of an array, i.e. the 75th percentile minus the 25th percentile.

StatsBase.nquantile — Function.

nquantile(v, n)

Return the n-quantiles of a real-valued array, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(v, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

Base.quantile — Function.

quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector v at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length.

With FrequencyWeights, the function returns the same result as quantile for a vector with repeated values. With non FrequencyWeights, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N}w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k}w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of v such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} -v_k)$ with $\gamma = (h - S_k)/(S_{k+1}-S_k)$. In particular, when w is a vector of ones, the function returns the same result as quantile.

Base.median — Method.

median(v::RealVector, w::AbstractWeights)

Compute the weighted median of x, using weights given by a weight vector w (of type AbstractWeights). The weight and data vectors must have the same length.

The weighted median $x_k$ is the element of x that satisfies $\sum_{x_i < x_k} w_i \le \frac{1}{2} \sum_{j} w_j$ and $\sum_{x_i > x_k} w_i \le \frac{1}{2} \sum_{j} w_j$.

If a weight has value zero, then its associated data point is ignored. If none of the weights are positive, an error is thrown. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

Mode and Modes

StatsBase.mode — Function.

mode(a, [r])

Return the mode (most common number) of an array, optionally over a specified range r. If several modes exist, the first one (in order of appearance) is returned.

StatsBase.modes — Function.

modes(a, [r])::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r.

Summary Statistics

StatsBase.summarystats — Function.

summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the mean, minimum, 25th percentile, median, 75th percentile, and maxmimum.

StatsBase.describe — Function.

describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.