# Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

## Weighted sum and mean

`Base.sum`

— Function`sum(v::AbstractArray, w::AbstractWeights{<:Real}; [dims])`

Compute the weighted sum of an array `v`

with weights `w`

, optionally over the dimension `dims`

.

`Base.sum!`

— Function```
sum!(R::AbstractArray, A::AbstractArray,
w::AbstractWeights{<:Real}, dim::Int;
init::Bool=true)
```

Compute the weighted sum of `A`

with weights `w`

over the dimension `dim`

and store the result in `R`

. If `init=false`

, the sum is added to `R`

rather than starting from zero.

`StatsBase.wsum`

— Function`wsum(v, w::AbstractVector, [dim])`

Compute the weighted sum of an array `v`

with weights `w`

, optionally over the dimension `dim`

.

`StatsBase.wsum!`

— Function```
wsum!(R::AbstractArray, A::AbstractArray,
w::AbstractVector, dim::Int;
init::Bool=true)
```

Compute the weighted sum of `A`

with weights `w`

over the dimension `dim`

and store the result in `R`

. If `init=false`

, the sum is added to `R`

rather than starting from zero.

`Statistics.mean`

— Function`mean(A::AbstractArray, w::AbstractWeights[, dims::Int])`

Compute the weighted mean of array `A`

with weight vector `w`

(of type `AbstractWeights`

). If `dim`

is provided, compute the weighted mean along dimension `dims`

.

**Examples**

```
n = 20
x = rand(n)
w = rand(n)
mean(x, weights(w))
```

`Statistics.mean!`

— Function`mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights[; dims=nothing])`

Compute the weighted mean of array `A`

with weight vector `w`

(of type `AbstractWeights`

) along dimension `dims`

, and write results to `R`

.

## Means

The package provides functions to compute means of different kinds.

`StatsBase.geomean`

— Function`geomean(a)`

Return the geometric mean of a collection.

`StatsBase.harmmean`

— Function`harmmean(a)`

Return the harmonic mean of a collection.

`StatsBase.genmean`

— Function`genmean(a, p)`

Return the generalized/power mean with exponent `p`

of a real-valued array, i.e. $\left( \frac{1}{n} \sum_{i=1}^n a_i^p \right)^{\frac{1}{p}}$, where `n = length(a)`

. It is taken to be the geometric mean when `p == 0`

.

## Moments and cumulants

`Statistics.var`

— Function`var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the variance of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)`

Compute the variance of the vector `x`

using the estimator `ce`

.

`Statistics.std`

— Function`std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the standard deviation of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)`

Compute the standard deviation of the vector `x`

using the estimator `ce`

.

`StatsBase.mean_and_var`

— Function`mean_and_var(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, var)`

Return the mean and variance of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if `corrected=true`

. See `var`

documentation for more details.

`StatsBase.mean_and_std`

— Function`mean_and_std(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, std)`

Return the mean and standard deviation of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if `corrected=true`

. See `std`

documentation for more details.

`StatsBase.skewness`

— Function`skewness(v, [wv::AbstractWeights], m=mean(v))`

Compute the standardized skewness of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.kurtosis`

— Function`kurtosis(v, [wv::AbstractWeights], m=mean(v))`

Compute the excess kurtosis of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.moment`

— Function`moment(v, k, [wv::AbstractWeights], m=mean(v))`

Return the `k`

th order central moment of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.cumulant`

— Function`cumulant(v, k, [wv::AbstractWeights], m=mean(v))`

Return the `k`

th order cumulant of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a pre-computed mean `m`

.

If `k`

is a range of `Integer`

s, then return all the cumulants of orders in this range as a vector.

This quantity is calculated using a recursive definition on lower-order cumulants and central moments.

Reference: Smith, P. J. 1995. A Recursive Formulation of the Old Problem of Obtaining Moments from Cumulants and Vice Versa. The American Statistician, 49(2), 217–218. https://doi.org/10.2307/2684642

## Measurements of Variation

`StatsBase.span`

— Function`span(x)`

Return the span of a collection, i.e. the range `minimum(x):maximum(x)`

. The minimum and maximum of `x`

are computed in one pass using `extrema`

.

`StatsBase.variation`

— Function`variation(x, m=mean(x); corrected=true)`

Return the coefficient of variation of collection `x`

, optionally specifying a precomputed mean `m`

, and the optional correction parameter `corrected`

. The coefficient of variation is the ratio of the standard deviation to the mean. If `corrected`

is `false`

, then `std`

is calculated with denominator `n`

. Else, the `std`

is calculated with denominator `n-1`

.

`StatsBase.sem`

— Function```
sem(x; mean=nothing)
sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)
```

Return the standard error of the mean for a collection `x`

. A pre-computed `mean`

may be provided.

When not using weights, this is the (sample) standard deviation divided by the sample size. If weights are used, the variance of the sample mean is calculated as follows:

`AnalyticWeights`

: Not implemented.`FrequencyWeights`

: $\frac{\sum_{i=1}^n w_i (x_i - \bar{x_i})^2}{(\sum w_i) (\sum w_i - 1)}$`ProbabilityWeights`

: $\frac{n}{n-1} \frac{\sum_{i=1}^n w_i^2 (x_i - \bar{x_i})^2}{\left( \sum w_i \right)^2}$

The standard error is then the square root of the above quantities.

**References**

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling. New York: Springer. pp. 51-53.

`StatsBase.mad`

— Function`mad(x; center=median(x), normalize=true)`

Compute the median absolute deviation (MAD) of collection `x`

around `center`

(by default, around the median).

If `normalize`

is set to `true`

, the MAD is multiplied by `1 / quantile(Normal(), 3/4) ≈ 1.4826`

, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

`StatsBase.mad!`

— Function`StatsBase.mad!(x; center=median!(x), normalize=true)`

Compute the median absolute deviation (MAD) of array `x`

around `center`

(by default, around the median), overwriting `x`

in the process.

If `normalize`

is set to `true`

, the MAD is multiplied by `1 / quantile(Normal(), 3/4) ≈ 1.4826`

, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

## Z-scores

`StatsBase.zscore`

— Function`zscore(X, [μ, σ])`

Compute the z-scores of `X`

, optionally specifying a precomputed mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

`μ`

and `σ`

should be both scalars or both arrays. The computation is broadcasting. In particular, when `μ`

and `σ`

are arrays, they should have the same size, and `size(μ, i) == 1 || size(μ, i) == size(X, i)`

for each dimension.

`StatsBase.zscore!`

— Function`zscore!([Z], X, μ, σ)`

Compute the z-scores of an array `X`

with mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array `Z`

is provided, the scores are stored in `Z`

and it must have the same shape as `X`

. Otherwise `X`

is overwritten.

## Entropy and Related Functions

`StatsBase.entropy`

— Function`entropy(p, [b])`

Compute the entropy of a collection of probabilities `p`

, optionally specifying a real number `b`

such that the entropy is scaled by `1/log(b)`

. Elements with probability 0 or 1 add 0 to the entropy.

`StatsBase.renyientropy`

— Function`renyientropy(p, α)`

Compute the Rényi (generalized) entropy of order `α`

of an array `p`

.

`StatsBase.crossentropy`

— Function`crossentropy(p, q, [b])`

Compute the cross entropy between `p`

and `q`

, optionally specifying a real number `b`

such that the result is scaled by `1/log(b)`

.

`StatsBase.kldivergence`

— Function`kldivergence(p, q, [b])`

Compute the Kullback-Leibler divergence from `q`

to `p`

, also called the relative entropy of `p`

with respect to `q`

, that is the sum `pᵢ * log(pᵢ / qᵢ)`

. Optionally a real number `b`

can be specified such that the divergence is scaled by `1/log(b)`

.

## Quantile and Related Functions

`StatsBase.percentile`

— Function`percentile(x, p)`

Return the `p`

th percentile of a collection `x`

, i.e. `quantile(x, p / 100)`

.

`StatsBase.iqr`

— Function`iqr(x)`

Compute the interquartile range (IQR) of collection `x`

, i.e. the 75th percentile minus the 25th percentile.

`StatsBase.nquantile`

— Function`nquantile(x, n::Integer)`

Return the n-quantiles of collection `x`

, i.e. the values which partition `v`

into `n`

subsets of nearly equal size.

Equivalent to `quantile(x, [0:n]/n)`

. For example, `nquantiles(x, 5)`

returns a vector of quantiles, respectively at `[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]`

.

`Statistics.quantile`

— Function`quantile(v, w::AbstractWeights, p)`

Compute the weighted quantiles of a vector `v`

at a specified set of probability values `p`

, using weights given by a weight vector `w`

(of type `AbstractWeights`

). Weights must not be negative. The weights and data vectors must have the same length. `NaN`

is returned if `x`

contains any `NaN`

values. An error is raised if `w`

contains any `NaN`

values.

With `FrequencyWeights`

, the function returns the same result as `quantile`

for a vector with repeated values. Weights must be integers.

With non `FrequencyWeights`

, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of `v`

such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted `quantile`

.

`Statistics.median`

— Method`median(v::AbstractVector{<:Real}, w::AbstractWeights)`

Compute the weighted median of `v`

with weights `w`

(of type `AbstractWeights`

). See the documentation for `quantile`

for more details.

`StatsBase.quantilerank`

— Function`quantilerank(itr, value; method=:inc)`

Compute the quantile position in the [0, 1] interval of `value`

relative to collection `itr`

.

Different definitions can be chosen via the `method`

keyword argument. Let `count_less`

be the number of elements of `itr`

that are less than `value`

, `count_equal`

the number of elements of `itr`

that are equal to `value`

, `n`

the length of `itr`

, `greatest_smaller`

the highest value below `value`

and `smallest_greater`

the lowest value above `value`

. Then `method`

supports the following definitions:

`:inc`

(default): Return a value in the range 0 to 1 inclusive.

Return `count_less / (n - 1)`

if `value ∈ itr`

, otherwise apply interpolation based on definition 7 of quantile in Hyndman and Fan (1996) (equivalent to Excel `PERCENTRANK`

and `PERCENTRANK.INC`

). This definition corresponds to the lower semi-continuous inverse of `quantile`

with its default parameters.

`:exc`

: Return a value in the range 0 to 1 exclusive.

Return `(count_less + 1) / (n + 1)`

if `value ∈ itr`

otherwise apply interpolation based on definition 6 of quantile in Hyndman and Fan (1996) (equivalent to Excel `PERCENTRANK.EXC`

).

`:compete`

: Return`count_less / (n - 1)`

if`value ∈ itr`

, otherwise

return `(count_less - 1) / (n - 1)`

, without interpolation (equivalent to MariaDB `PERCENT_RANK`

, dplyr `percent_rank`

).

`:tied`

: Return`(count_less + count_equal/2) / n`

, without interpolation.

Based on the definition in Roscoe, J. T. (1975) (equivalent to `"mean"`

kind of SciPy `percentileofscore`

).

`:strict`

: Return`count_less / n`

, without interpolation

(equivalent to `"strict"`

kind of SciPy `percentileofscore`

).

`:weak`

: Return`(count_less + count_equal) / n`

, without interpolation

(equivalent to `"weak"`

kind of SciPy `percentileofscore`

).

An `ArgumentError`

is thrown if `itr`

contains `NaN`

or `missing`

values or if `itr`

contains fewer than two elements.

**References**

Roscoe, J. T. (1975). Fundamental Research Statistics for the Behavioral Sciences", 2nd ed., New York : Holt, Rinehart and Winston.

Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", *The American Statistician*, Vol. 50, No. 4, pp. 361-365.

**Examples**

```
julia> using StatsBase
julia> v1 = [1, 1, 1, 2, 3, 4, 8, 11, 12, 13];
julia> v2 = [1, 2, 3, 5, 6, missing, 8];
julia> v3 = [1, 2, 3, 4, 4, 5, 6, 7, 8, 9];
julia> quantilerank(v1, 2)
0.3333333333333333
julia> quantilerank(v1, 2, method=:exc), quantilerank(v1, 2, method=:tied)
(0.36363636363636365, 0.35)
# use `skipmissing` for vectors with missing entries.
julia> quantilerank(skipmissing(v2), 4)
0.5
# use broadcasting with `Ref` to compute quantile rank for multiple values
julia> quantilerank.(Ref(v3), [4, 8])
2-element Vector{Float64}:
0.3333333333333333
0.8888888888888888
```

`StatsBase.percentilerank`

— Function`percentilerank(itr, value; method=:inc)`

Return the `q`

th percentile of `value`

in collection `itr`

, i.e. `quantilerank(itr, value)`

* 100.

See the `quantilerank`

docstring for more details.

## Mode and Modes

`StatsBase.mode`

— Function```
mode(a, [r])
mode(a::AbstractArray, wv::AbstractWeights)
```

Return the mode (most common number) of an array, optionally over a specified range `r`

or weighted via a vector `wv`

. If several modes exist, the first one (in order of appearance) is returned.

`StatsBase.modes`

— Function```
modes(a, [r])::Vector
mode(a::AbstractArray, wv::AbstractWeights)::Vector
```

Return all modes (most common numbers) of an array, optionally over a specified range `r`

or weighted via vector `wv`

.

## Summary Statistics

`StatsBase.summarystats`

— Function`summarystats(a)`

Compute summary statistics for a real-valued array `a`

. Returns a `SummaryStats`

object containing the number of observations, number of missing observations, standard deviation, mean, minimum, 25th percentile, median, 75th percentile, and maximum.

`DataAPI.describe`

— Function`describe(a)`

Pretty-print the summary statistics provided by `summarystats`

: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

## Reliability Measures

`StatsBase.cronbachalpha`

— Function`cronbachalpha(covmatrix::AbstractMatrix{<:Real})`

Calculate Cronbach's alpha (1951) from a covariance matrix `covmatrix`

according to the formula:

\[\rho = \frac{k}{k-1} (1 - \frac{\sum^k_{i=1} \sigma^2_i}{\sum_{i=1}^k \sum_{j=1}^k \sigma_{ij}})\]

where $k$ is the number of items, i.e. columns, $\sigma_i^2$ the item variance, and $\sigma_{ij}$ the inter-item covariance.

Returns a `CronbachAlpha`

object that holds:

`alpha`

: the Cronbach's alpha score for all items, i.e. columns, in`covmatrix`

; and`dropped`

: a vector giving Cronbach's alpha scores if a specific item, i.e. column, is dropped from`covmatrix`

.

**Example**

```
julia> using StatsBase
julia> cov_X = [10 6 6 6;
6 11 6 6;
6 6 12 6;
6 6 6 13];
julia> cronbachalpha(cov_X)
Cronbach's alpha for all items: 0.8136
Cronbach's alpha if an item is dropped:
item 1: 0.7500
item 2: 0.7606
item 3: 0.7714
item 4: 0.7826
```