Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Weighted sum and mean

Base.sum — Function

sum(v::AbstractArray, w::AbstractWeights{<:Real}; [dims])

Compute the weighted sum of an array v with weights w, optionally over the dimension dims.

source

Base.sum! — Function

sum!(R::AbstractArray, A::AbstractArray,
     w::AbstractWeights{<:Real}, dim::Int;
     init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source

StatsBase.wsum — Function

wsum(v, w::AbstractVector, [dim])

Compute the weighted sum of an array v with weights w, optionally over the dimension dim.

source

StatsBase.wsum! — Function

wsum!(R::AbstractArray, A::AbstractArray,
      w::AbstractVector, dim::Int;
      init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source

Statistics.mean — Function

mean(A::AbstractArray, w::AbstractWeights[, dims::Int])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights). If dim is provided, compute the weighted mean along dimension dims.

Examples

n = 20
x = rand(n)
w = rand(n)
mean(x, weights(w))

source

Statistics.mean! — Function

mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights[; dims=nothing])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights) along dimension dims, and write results to R.

source

Means

The package provides functions to compute means of different kinds.

StatsBase.geomean — Function

geomean(a)

Return the geometric mean of a collection.

source

StatsBase.harmmean — Function

harmmean(a)

Return the harmonic mean of a collection.

source

StatsBase.genmean — Function

genmean(a, p)

Return the generalized/power mean with exponent p of a real-valued array, i.e. $\left( \frac{1}{n} \sum_{i=1}^n a_i^p \right)^{\frac{1}{p}}$, where n = length(a). It is taken to be the geometric mean when p == 0.

source

Moments and cumulants

Statistics.var — Function

var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

source

var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the variance of the vector x using the estimator ce.

source

Statistics.std — Function

std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
FrequencyWeights: $\frac{1}{\sum{w} - 1}$
ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
Weights: ArgumentError (bias correction not supported)

source

std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the standard deviation of the vector x using the estimator ce.

source

StatsBase.mean_and_var — Function

mean_and_var(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, var)

Return the mean and variance of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

source

StatsBase.mean_and_std — Function

mean_and_std(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, std)

Return the mean and standard deviation of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

source

StatsBase.skewness — Function

skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source

StatsBase.kurtosis — Function

kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source

StatsBase.moment — Function

moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source

StatsBase.cumulant — Function

cumulant(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order cumulant of a real-valued array v, optionally specifying a weighting vector wv and a pre-computed mean m.

If k is a range of Integers, then return all the cumulants of orders in this range as a vector.

This quantity is calculated using a recursive definition on lower-order cumulants and central moments.

Reference: Smith, P. J. 1995. A Recursive Formulation of the Old Problem of Obtaining Moments from Cumulants and Vice Versa. The American Statistician, 49(2), 217–218. https://doi.org/10.2307/2684642

source

Measurements of Variation

StatsBase.span — Function

span(x)

Return the span of a collection, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one pass using extrema.

source

StatsBase.variation — Function

variation(x, m=mean(x); corrected=true)

Return the coefficient of variation of collection x, optionally specifying a precomputed mean m, and the optional correction parameter corrected. The coefficient of variation is the ratio of the standard deviation to the mean. If corrected is false, then std is calculated with denominator n. Else, the std is calculated with denominator n-1.

source

StatsBase.sem — Function

sem(x; mean=nothing)
sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)

Return the standard error of the mean for a collection x. A pre-computed mean may be provided.

When not using weights, this is the (sample) standard deviation divided by the sample size. If weights are used, the variance of the sample mean is calculated as follows:

AnalyticWeights: Not implemented.
FrequencyWeights: $\frac{\sum_{i=1}^n w_i (x_i - \bar{x_i})^2}{(\sum w_i) (\sum w_i - 1)}$
ProbabilityWeights: $\frac{n}{n-1} \frac{\sum_{i=1}^n w_i^2 (x_i - \bar{x_i})^2}{\left( \sum w_i \right)^2}$

The standard error is then the square root of the above quantities.

References

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling. New York: Springer. pp. 51-53.

source

StatsBase.mad — Function

mad(x; center=median(x), normalize=true)

Compute the median absolute deviation (MAD) of collection x around center (by default, around the median).

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source

StatsBase.mad! — Function

StatsBase.mad!(x; center=median!(x), normalize=true)

Compute the median absolute deviation (MAD) of array x around center (by default, around the median), overwriting x in the process.

source

Z-scores

StatsBase.zscore — Function

zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

source

StatsBase.zscore! — Function

zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

source

StatsBase.entropy — Function

entropy(p, [b])

Compute the entropy of a collection of probabilities p, optionally specifying a real number b such that the entropy is scaled by 1/log(b). Elements with probability 0 or 1 add 0 to the entropy.

source

StatsBase.renyientropy — Function

renyientropy(p, α)

Compute the Rényi (generalized) entropy of order α of an array p.

source

StatsBase.crossentropy — Function

crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

source

StatsBase.kldivergence — Function

kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence from q to p, also called the relative entropy of p with respect to q, that is the sum pᵢ * log(pᵢ / qᵢ). Optionally a real number b can be specified such that the divergence is scaled by 1/log(b).

source

StatsBase.percentile — Function

percentile(x, p)

Return the pth percentile of a collection x, i.e. quantile(x, p / 100).

source

StatsBase.iqr — Function

iqr(x)

Compute the interquartile range (IQR) of collection x, i.e. the 75th percentile minus the 25th percentile.

source

StatsBase.nquantile — Function

nquantile(x, n::Integer)

Return the n-quantiles of collection x, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(x, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

source

Statistics.quantile — Function

quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector v at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

With FrequencyWeights, the function returns the same result as quantile for a vector with repeated values. Weights must be integers.

With non FrequencyWeights, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of v such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted quantile.

source

Statistics.median — Method

median(v::AbstractVector{<:Real}, w::AbstractWeights)

Compute the weighted median of v with weights w (of type AbstractWeights). See the documentation for quantile for more details.

source

StatsBase.quantilerank — Function

quantilerank(itr, value; method=:inc)

Compute the quantile position in the [0, 1] interval of value relative to collection itr.

Different definitions can be chosen via the method keyword argument. Let count_less be the number of elements of itr that are less than value, count_equal the number of elements of itr that are equal to value, n the length of itr, greatest_smaller the highest value below value and smallest_greater the lowest value above value. Then method supports the following definitions:

:inc (default): Return a value in the range 0 to 1 inclusive.

Return count_less / (n - 1) if value ∈ itr, otherwise apply interpolation based on definition 7 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK and PERCENTRANK.INC). This definition corresponds to the lower semi-continuous inverse of quantile with its default parameters.

:exc: Return a value in the range 0 to 1 exclusive.

Return (count_less + 1) / (n + 1) if value ∈ itr otherwise apply interpolation based on definition 6 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK.EXC).

:compete: Return count_less / (n - 1) if value ∈ itr, otherwise

return (count_less - 1) / (n - 1), without interpolation (equivalent to MariaDB PERCENT_RANK, dplyr percent_rank).

:tied: Return (count_less + count_equal/2) / n, without interpolation.

Based on the definition in Roscoe, J. T. (1975) (equivalent to "mean" kind of SciPy percentileofscore).

:strict: Return count_less / n, without interpolation

(equivalent to "strict" kind of SciPy percentileofscore).

:weak: Return (count_less + count_equal) / n, without interpolation

(equivalent to "weak" kind of SciPy percentileofscore).

Note

An ArgumentError is thrown if itr contains NaN or missing values or if itr contains fewer than two elements.

References

Roscoe, J. T. (1975). Fundamental Research Statistics for the Behavioral Sciences", 2nd ed., New York : Holt, Rinehart and Winston.

Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician, Vol. 50, No. 4, pp. 361-365.

Examples

julia> using StatsBase

julia> v1 = [1, 1, 1, 2, 3, 4, 8, 11, 12, 13];

julia> v2 = [1, 2, 3, 5, 6, missing, 8];

julia> v3 = [1, 2, 3, 4, 4, 5, 6, 7, 8, 9];

julia> quantilerank(v1, 2)
0.3333333333333333

julia> quantilerank(v1, 2, method=:exc), quantilerank(v1, 2, method=:tied)
(0.36363636363636365, 0.35)

# use `skipmissing` for vectors with missing entries.
julia> quantilerank(skipmissing(v2), 4)
0.5

# use broadcasting with `Ref` to compute quantile rank for multiple values
julia> quantilerank.(Ref(v3), [4, 8])
2-element Vector{Float64}:
 0.3333333333333333
 0.8888888888888888

source

StatsBase.percentilerank — Function

percentilerank(itr, value; method=:inc)

Return the qth percentile of value in collection itr, i.e. quantilerank(itr, value) * 100.

See the quantilerank docstring for more details.

source

Mode and Modes

StatsBase.mode — Function

mode(a, [r])
mode(a::AbstractArray, wv::AbstractWeights)

Return the mode (most common number) of an array, optionally over a specified range r or weighted via a vector wv. If several modes exist, the first one (in order of appearance) is returned.

source

StatsBase.modes — Function

modes(a, [r])::Vector
mode(a::AbstractArray, wv::AbstractWeights)::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r or weighted via vector wv.

source

Summary Statistics

StatsBase.summarystats — Function

summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the number of observations, number of missing observations, standard deviation, mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source

DataAPI.describe — Function

describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source

Reliability Measures

StatsBase.cronbachalpha — Function

cronbachalpha(covmatrix::AbstractMatrix{<:Real})

Calculate Cronbach's alpha (1951) from a covariance matrix covmatrix according to the formula:

\[\rho = \frac{k}{k-1} (1 - \frac{\sum^k_{i=1} \sigma^2_i}{\sum_{i=1}^k \sum_{j=1}^k \sigma_{ij}})\]

where $k$ is the number of items, i.e. columns, $\sigma_i^2$ the item variance, and $\sigma_{ij}$ the inter-item covariance.

Returns a CronbachAlpha object that holds:

alpha: the Cronbach's alpha score for all items, i.e. columns, in covmatrix; and
dropped: a vector giving Cronbach's alpha scores if a specific item, i.e. column, is dropped from covmatrix.

Example

julia> using StatsBase

julia> cov_X = [10 6 6 6;
                6 11 6 6;
                6 6 12 6;
                6 6 6 13];

julia> cronbachalpha(cov_X)
Cronbach's alpha for all items: 0.8136

Cronbach's alpha if an item is dropped:
item 1: 0.7500
item 2: 0.7606
item 3: 0.7714
item 4: 0.7826

source