Sampling from Population

Sampling API

The package provides functions for sampling from a given population (with or without replacement).

StatsBase.sample — Function

sample([rng], a, [wv::AbstractWeights])

Select a single random element of a. Sampling probabilities are proportional to the weights given in wv, if provided.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)

Select a random, optionally weighted sample of size n from an array a using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)

Select a random, optionally weighted sample from an array a specifying the dimensions dims of the output array. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

sample([rng], wv::AbstractWeights)

Select a single random integer in 1:length(wv) with probabilities proportional to the weights given in wv.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

StatsBase.sample! — Function

sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)

Draw a random sample of length(x) elements from an array a and store the result in x. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

Output array a must not be the same object as x or wv nor share memory with them, or the result may be incorrect.

source

StatsBase.wsample — Function

wsample([rng], [a], w)

Select a weighted random sample of size 1 from a with probabilities proportional to the weights given in w. If a is not present, select a random weight from w.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

wsample([rng], [a], w, n::Integer; replace=true, ordered=false)

Select a weighted random sample of size n from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

wsample([rng], [a], w, dims::Dims; replace=true, ordered=false)

Select a weighted random sample from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. The dimensions of the output are given by dims.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

StatsBase.wsample! — Function

wsample!([rng], a, w, x; replace=true, ordered=false)

Select a weighted sample from an array a and store the result in x. Sampling probabilities are proportional to the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

Algorithms

Internally, this package implements multiple algorithms, and the sample (and sample!) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in sample are decided based on extensive benchmarking (see perf/sampling.jl and perf/wsampling.jl). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using though).

Notations

a: source array representing the population
x: the destination array
wv: the weight vector (of type AbstractWeights), for weighted sampling
n: the length of a
k: the length of x. For sampling without replacement, k must not exceed n.
rng: optional random number generator (defaults to Random.GLOBAL_RNG)

All following functions write results to x (pre-allocated) and return x.

Sampling Algorithms (Non-Weighted)

StatsBase.direct_sample! — Method

direct_sample!([rng], a::AbstractArray, x::AbstractArray)

Direct sampling: for each j in 1:k, randomly pick i from 1:n, and set x[j] = a[i], with n=length(a) and k=length(x).

This algorithm consumes k random numbers.

source

StatsBase.samplepair — Function

samplepair([rng], n)

Draw a pair of distinct integers between 1 and n without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

samplepair([rng], a)

Draw a pair of distinct elements from the array a without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

source

StatsBase.knuths_sample! — Function

knuths_sample!([rng], a, x)

Knuth's Algorithm S for random sampling without replacement.

Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.

This algorithm consumes length(a) random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

source

StatsBase.fisher_yates_sample! — Function

fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)

Fisher-Yates shuffling (with early termination).

Pseudo-code:

n = length(a)
k = length(x)

# Create an array of the indices
inds = collect(1:n)

for i = 1:k
    # swap element `i` with another random element in inds[i:n]
    # set element `i` in `x`
end

This algorithm consumes k=length(x) random numbers. It uses an integer array of length n=length(a) internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n is greater than k. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling

source

StatsBase.self_avoid_sample! — Function

self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) k=length(x) random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a).

However, if k is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.

source

StatsBase.seqsample_a! — Function

seqsample_a!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(n)$ random numbers, with n=length(a). The outputs are ordered.

source

StatsBase.seqsample_c! — Function

seqsample_c!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k^2)$ random numbers, with k=length(x). The outputs are ordered.

source

StatsBase.seqsample_d! — Function

seqsample_d!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k)$ random numbers, with k=length(x). The outputs are ordered.

source

Weighted Sampling Algorithms

StatsBase.direct_sample! — Method

direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Direct sampling.

Draw each sample by scanning the weight vector.

Noting k=length(x) and n=length(a), this algorithm:

consumes k random numbers
has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
requires no additional memory space.

source

StatsBase.alias_sample! — Function

alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.

Noting k=length(x) and n=length(a), this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.

source

StatsBase.naive_wsample_norep! — Function

naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting k=length(x) and n=length(a), this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.

source

StatsBase.efraimidis_a_wsample_norep! — Function

efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source

StatsBase.efraimidis_ares_wsample_norep! — Function

efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source