Sampling from Population

# Sampling from Population

## Sampling API

The package provides functions for sampling from a given population (with or without replacement).

``sample([rng], a, [wv::AbstractWeights])``

Select a single random element of `a`. Sampling probabilities are proportional to the weights given in `wv`, if provided.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)``

Select a random, optionally weighted sample of size `n` from an array `a` using a polyalgorithm. Sampling probabilities are proportional to the weights given in `wv`, if provided. `replace` dictates whether sampling is performed with replacement and `order` dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)``

Select a random, optionally weighted sample from an array `a` specifying the dimensions `dims` of the output array. Sampling probabilities are proportional to the weights given in `wv`, if provided. `replace` dictates whether sampling is performed with replacement and `order` dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``sample([rng], wv::AbstractWeights)``

Select a single random integer in `1:length(wv)` with probabilities proportional to the weights given in `wv`.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)``

Draw a random sample of `length(x)` elements from an array `a` and store the result in `x`. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in `wv`, if provided. `replace` dictates whether sampling is performed with replacement and `order` dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source

## Algorithms

Internally, this package implements multiple algorithms, and the `sample` (and `sample!`) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in `sample` are decided based on extensive benchmarking (see `perf/sampling.jl` and `perf/wsampling.jl`). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via `using` though).

### Notations

• `a`: source array representing the population

• `x`: the destination array

• `wv`: the weight vector (of type `AbstractWeights`), for weighted sampling

• `n`: the length of `a`

• `k`: the length of `x`. For sampling without replacement, `k` must not exceed `n`.

• `rng`: optional random number generator (defaults to `Random.GLOBAL_RNG`)

All following functions write results to `x` (pre-allocated) and return `x`.

### Sampling Algorithms (Non-Weighted)

``direct_sample!([rng], a::AbstractArray, x::AbstractArray)``

Direct sampling: for each `j` in `1:k`, randomly pick `i` from `1:n`, and set `x[j] = a[i]`, with `n=length(a)` and `k=length(x)`.

This algorithm consumes `k` random numbers.

source
``samplepair([rng], n)``

Draw a pair of distinct integers between 1 and `n` without replacement.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``samplepair([rng], a)``

Draw a pair of distinct elements from the array `a` without replacement.

Optionally specify a random number generator `rng` as the first argument (defaults to `Random.GLOBAL_RNG`).

source
``knuths_sample!([rng], a, x)``

Knuth's Algorithm S for random sampling without replacement.

Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.

This algorithm consumes `length(a)` random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

source
``fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)``

Fisher-Yates shuffling (with early termination).

Pseudo-code:

``````n = length(a)
k = length(x)

# Create an array of the indices
inds = collect(1:n)

for i = 1:k
# swap element `i` with another random element in inds[i:n]
# set element `i` in `x`
end``````

This algorithm consumes `k=length(x)` random numbers. It uses an integer array of length `n=length(a)` internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when `n` is greater than `k`. It is \$O(n)\$ for initialization, plus \$O(k)\$ for random shuffling

source
``self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)``

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) `k=length(x)` random numbers, and requires \$O(k)\$ memory to store the set of sampled indices. Very fast when \$n >> k\$, with `n=length(a)`.

However, if `k` is large and approaches \$n\$, the rejection rate would increase drastically, resulting in poorer performance.

source
``seqsample_a!([rng], a::AbstractArray, x::AbstractArray)``

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes \$O(n)\$ random numbers, with `n=length(a)`. The outputs are ordered.

source
``seqsample_c!([rng], a::AbstractArray, x::AbstractArray)``

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes \$O(k^2)\$ random numbers, with `k=length(x)`. The outputs are ordered.

source

### Weighted Sampling Algorithms

``direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)``

Direct sampling.

Draw each sample by scanning the weight vector.

Noting `k=length(x)` and `n=length(a)`, this algorithm:

• consumes `k` random numbers

• has time complexity \$O(n k)\$, as scanning the weight vector each time takes \$O(n)\$

• requires no additional memory space.

source
``alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)``

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.

Noting `k=length(x)` and `n=length(a)`, this algorithm takes \$O(n \log n)\$ time for building the alias table, and then \$O(1)\$ to draw each sample. It consumes \$2 k\$ random numbers.

source
``naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)``

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting `k=length(x)` and `n=length(a)`, this algorithm consumes \$O(k)\$ random numbers, and has overall time complexity \$O(n k)\$.

source
``efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)``

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting `k=length(x)` and `n=length(a)`, this algorithm takes \$O(n + k \log k)\$ processing time to draw \$k\$ elements. It consumes \$n\$ random numbers.

source
``efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)``

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting `k=length(x)` and `n=length(a)`, this algorithm takes \$O(k \log(k) \log(n / k))\$ processing time to draw \$k\$ elements. It consumes \$n\$ random numbers.

source