Sampling from Population
Sampling API
The package provides functions for sampling from a given population (with or without replacement).
StatsBase.sample — Functionsample([rng], a::AbstractArray, [wv::AbstractWeights])Select a single random element of a. Sampling probabilities are proportional to the weights given in wv, if provided.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)Select a random, optionally weighted sample of size n from an array a using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)Select a random, optionally weighted sample from an array a specifying the dimensions dims of the output array. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
sample([rng], wv::AbstractWeights)Select a single random integer in 1:length(wv) with probabilities proportional to the weights given in wv.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
StatsBase.sample! — Functionsample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)Draw a random sample of length(x) elements from an array a and store the result in x. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
Output array a must not be the same object as x or wv nor share memory with them, or the result may be incorrect.
StatsBase.wsample — Functionwsample([rng], [a], w)Select a weighted random sample of size 1 from a with probabilities proportional to the weights given in w. If a is not present, select a random weight from w.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
wsample([rng], [a], w, n::Integer; replace=true, ordered=false)Select a weighted random sample of size n from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
wsample([rng], [a], w, dims::Dims; replace=true, ordered=false)Select a weighted random sample from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. The dimensions of the output are given by dims.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
StatsBase.wsample! — Functionwsample!([rng], a, w, x; replace=true, ordered=false)Select a weighted sample from an array a and store the result in x. Sampling probabilities are proportional to the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
Algorithms
Internally, this package implements multiple algorithms, and the sample (and sample!) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.
Note that the choices made in sample are decided based on extensive benchmarking (see perf/sampling.jl and perf/wsampling.jl). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.
Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using though).
Notations
- a: source array representing the population
- x: the destination array
- wv: the weight vector (of type- AbstractWeights), for weighted sampling
- n: the length of- a
- k: the length of- x. For sampling without replacement,- kmust not exceed- n.
- rng: optional random number generator (defaults to- Random.default_rng()on Julia >= 1.3 and- Random.GLOBAL_RNGon Julia < 1.3)
All following functions write results to x (pre-allocated) and return x.
Sampling Algorithms (Non-Weighted)
StatsBase.direct_sample! — Methoddirect_sample!([rng], a::AbstractArray, x::AbstractArray)Direct sampling: for each j in 1:k, randomly pick i from 1:n, and set x[j] = a[i], with n=length(a) and k=length(x).
This algorithm consumes k random numbers.
StatsBase.samplepair — Functionsamplepair([rng], n)Draw a pair of distinct integers between 1 and n without replacement.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
samplepair([rng], a)Draw a pair of distinct elements from the array a without replacement.
Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).
StatsBase.knuths_sample! — Functionknuths_sample!([rng], a, x)Knuth's Algorithm S for random sampling without replacement.
Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.
This algorithm consumes length(a) random numbers. It requires no additional memory space. Suitable for the case where memory is tight.
StatsBase.fisher_yates_sample! — Functionfisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)Fisher-Yates shuffling (with early termination).
Pseudo-code:
n = length(a)
k = length(x)
# Create an array of the indices
inds = collect(1:n)
for i = 1:k
    # swap element `i` with another random element in inds[i:n]
    # set element `i` in `x`
endThis algorithm consumes k=length(x) random numbers. It uses an integer array of length n=length(a) internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n is greater than k. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling
StatsBase.self_avoid_sample! — Functionself_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.
This algorithm consumes about (or slightly more than) k=length(x) random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a).
However, if k is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.
StatsBase.seqsample_a! — Functionseqsample_a!([rng], a::AbstractArray, x::AbstractArray)Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.
This algorithm consumes $O(n)$ random numbers, with n=length(a). The outputs are ordered.
StatsBase.seqsample_c! — Functionseqsample_c!([rng], a::AbstractArray, x::AbstractArray)Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.
This algorithm consumes $O(k^2)$ random numbers, with k=length(x). The outputs are ordered.
StatsBase.seqsample_d! — Functionseqsample_d!([rng], a::AbstractArray, x::AbstractArray)Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.
This algorithm consumes $O(k)$ random numbers, with k=length(x). The outputs are ordered.
Weighted Sampling Algorithms
StatsBase.direct_sample! — Methoddirect_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)Direct sampling.
Draw each sample by scanning the weight vector.
Noting k=length(x) and n=length(a), this algorithm:
- consumes krandom numbers
- has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
- requires no additional memory space.
StatsBase.alias_sample! — Functionalias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)Alias method.
Build an alias table, and sample therefrom.
Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.
Noting k=length(x) and n=length(a), this algorithm takes $O(n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $k$ random numbers.
StatsBase.naive_wsample_norep! — Functionnaive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)Naive implementation of weighted sampling without replacement.
It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.
Noting k=length(x) and n=length(a), this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.
StatsBase.efraimidis_a_wsample_norep! — Functionefraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.
Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.
Noting k=length(x) and n=length(a), this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.
StatsBase.efraimidis_ares_wsample_norep! — Functionefraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.
Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.
Noting k=length(x) and n=length(a), this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.