Scalar Statistics

Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Moments

Base.varMethod.
var(x, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$

  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$

  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)

  • Weights: ArgumentError (bias correction not supported)

source
Base.stdMethod.
std(v, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$

  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$

  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)

  • Weights: ArgumentError (bias correction not supported)

source
mean_and_var(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, var)

Return the mean and variance of a real-valued array x, optionally over a dimension dim, as a tuple. Observations in x can be weighted using weight vector w. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

source
mean_and_std(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, std)

Return the mean and standard deviation of a real-valued array x, optionally over a dimension dim, as a tuple. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

source
StatsBase.skewnessFunction.
skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.kurtosisFunction.
kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.momentFunction.
moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source

Measurements of Variation

StatsBase.spanFunction.
span(x)

Return the span of an integer array, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one-pass using extrema.

source
StatsBase.variationFunction.
variation(x, m=mean(x))

Return the coefficient of variation of an array x, optionally specifying a precomputed mean m. The coefficient of variation is the ratio of the standard deviation to the mean.

source
StatsBase.semFunction.
sem(a)

Return the standard error of the mean of a, i.e. sqrt(var(a) / length(a)).

source
StatsBase.madFunction.
mad(v; center=median(v), normalize=true)

Compute the median absolute deviation (MAD) of v around center (by default, around the median).

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source

Z-scores

StatsBase.zscoreFunction.
zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

source
StatsBase.zscore!Function.
zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

source

Entropy and Related Functions

StatsBase.entropyFunction.
entropy(p, [b])

Compute the entropy of an array p, optionally specifying a real number b such that the entropy is scaled by 1/log(b).

source
renyientropy(p, α)

Compute the Rényi (generalized) entropy of order α of an array p.

source
crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

source
kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence of q from p, optionally specifying a real number b such that the divergence is scaled by 1/log(b).

source

Quantile and Related Functions

StatsBase.percentileFunction.
percentile(v, p)

Return the pth percentile of a real-valued array v, i.e. quantile(x, p / 100).

source
StatsBase.iqrFunction.
iqr(v)

Compute the interquartile range (IQR) of an array, i.e. the 75th percentile minus the 25th percentile.

source
StatsBase.nquantileFunction.
nquantile(v, n)

Return the n-quantiles of a real-valued array, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(v, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

source
Base.quantileFunction.
quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector v at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length.

With FrequencyWeights, the function returns the same result as quantile for a vector with repeated values. With non FrequencyWeights, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N}w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k}w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of v such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} -v_k)$ with $\gamma = (h - S_k)/(S_{k+1}-S_k)$. In particular, when w is a vector of ones, the function returns the same result as quantile.

source
Base.medianMethod.
median(v::RealVector, w::AbstractWeights)

Compute the weighted median of x, using weights given by a weight vector w (of type AbstractWeights). The weight and data vectors must have the same length.

The weighted median $x_k$ is the element of x that satisfies $\sum_{x_i < x_k} w_i \le \frac{1}{2} \sum_{j} w_j$ and $\sum_{x_i > x_k} w_i \le \frac{1}{2} \sum_{j} w_j$.

If a weight has value zero, then its associated data point is ignored. If none of the weights are positive, an error is thrown. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

source

Mode and Modes

StatsBase.modeFunction.
mode(a, [r])

Return the mode (most common number) of an array, optionally over a specified range r. If several modes exist, the first one (in order of appearance) is returned.

source
StatsBase.modesFunction.
modes(a, [r])::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r.

source

Summary Statistics

summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the mean, minimum, 25th percentile, median, 75th percentile, and maxmimum.

source
StatsBase.describeFunction.
describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source