Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Moments

Base.var — Method.

var(x, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

Base.std — Method.

std(v, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

StatsBase.mean_and_var — Function.

mean_and_var(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, var)

Return the mean and variance of a real-valued array x, optionally over a dimension dim, as a tuple. Observations in x can be weighted using weight vector w. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

StatsBase.mean_and_std — Function.

mean_and_std(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, std)

Return the mean and standard deviation of a real-valued array x, optionally over a dimension dim, as a tuple. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

StatsBase.skewness — Function.

skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

StatsBase.kurtosis — Function.

kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

StatsBase.moment — Function.

moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

Measurements of Variation

StatsBase.span — Function.

span(x)

Return the span of an integer array, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one-pass using extrema.

StatsBase.variation — Function.

variation(x, m=mean(x))

Return the coefficient of variation of an array x, optionally specifying a precomputed mean m. The coefficient of variation is the ratio of the standard deviation to the mean.

StatsBase.sem — Function.

sem(a)

Return the standard error of the mean of a, i.e. sqrt(var(a) / length(a)).

StatsBase.mad — Function.

mad(v)

Compute the median absolute deviation of v.

Z-scores

StatsBase.zscore — Function.

zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

StatsBase.zscore! — Function.

zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

Entropy and Related Functions

StatsBase.entropy — Function.

entropy(p, [b])

Compute the entropy of an array p, optionally specifying a real number b such that the entropy is scaled by 1/log(b).

StatsBase.renyientropy — Function.

renyientropy(p, α)

Compute the Rényi (generalized) entropy of order α of an array p.

StatsBase.crossentropy — Function.

crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

StatsBase.kldivergence — Function.

kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence of q from p, optionally specifying a real number b such that the divergence is scaled by 1/log(b).

Quantile and Related Functions

StatsBase.percentile — Function.

percentile(v, p)

Return the pth percentile of a real-valued array v, i.e. quantile(x, p / 100).

StatsBase.iqr — Function.

iqr(v)

Compute the interquartile range (IQR) of an array, i.e. the 75th percentile minus the 25th percentile.

StatsBase.nquantile — Function.

nquantile(v, n)

Return the n-quantiles of a real-valued array, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(v, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

Base.quantile — Function.

quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector x at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length.

The quantile for p is defined as follows. Denoting $S_k = (k-1)w_k + (n-1) \sum_{i<k}w_i$, define $x_{k+1}$ the smallest element of x such that $S_{k+1}/S_{n}$ is strictly superior to p. The function returns $(1-\gamma) x_k + \gamma x_{k+1}$ with $\gamma = (pS_n- S_k)/(S_{k+1}-S_k)$.

This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when w contains only ones (see Wikipedia).

Base.median — Method.

median(v::RealVector, w::AbstractWeights)

Compute the weighted median of x, using weights given by a weight vector w (of type AbstractWeights). The weight and data vectors must have the same length.

The weighted median $x_k$ is the element of x that satisfies $\sum_{x_i < x_k} w_i \le \frac{1}{2} \sum_{j} w_j$ and $\sum_{x_i > x_k} w_i \le \frac{1}{2} \sum_{j} w_j$.

If a weight has value zero, then its associated data point is ignored. If none of the weights are positive, an error is thrown. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

Mode and Modes

StatsBase.mode — Function.

mode(a, [r])

Return the mode (most common number) of an array, optionally over a specified range r. If several modes exist, the first one (in order of appearance) is returned.

StatsBase.modes — Function.

modes(a, [r])::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r.

Summary Statistics

StatsBase.summarystats — Function.

summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the mean, minimum, 25th percentile, median, 75th percentile, and maxmimum.

StatsBase.describe — Function.

describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.