Sampling from Population
Sampling API
The package provides functions for sampling from a given population (with or without replacement).
StatsBase.sample
— Function.sample([rng], a, [wv::AbstractWeights])
Select a single random element of a
. Sampling probabilities are proportional to the weights given in wv
, if provided.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)
Select a random, optionally weighted sample of size n
from an array a
using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv
, if provided. replace
dictates whether sampling is performed with replacement and order
dictates whether an ordered sample, also called a sequential sample, should be taken.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)
Select a random, optionally weighted sample from an array a
specifying the dimensions dims
of the output array. Sampling probabilities are proportional to the weights given in wv
, if provided. replace
dictates whether sampling is performed with replacement and order
dictates whether an ordered sample, also called a sequential sample, should be taken.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
sample([rng], wv::AbstractWeights)
Select a single random integer in 1:length(wv)
with probabilities proportional to the weights given in wv
.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
StatsBase.sample!
— Function.sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)
Draw a random sample of length(x)
elements from an array a
and store the result in x
. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv
, if provided. replace
dictates whether sampling is performed with replacement and order
dictates whether an ordered sample, also called a sequential sample, should be taken.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
Algorithms
Internally, this package implements multiple algorithms, and the sample
(and sample!
) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.
Note that the choices made in sample
are decided based on extensive benchmarking (see perf/sampling.jl
and perf/wsampling.jl
). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.
Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using
though).
Notations
a
: source array representing the populationx
: the destination arraywv
: the weight vector (of typeAbstractWeights
), for weighted samplingn
: the length ofa
k
: the length ofx
. For sampling without replacement,k
must not exceedn
.rng
: optional random number generator (defaults toRandom.GLOBAL_RNG
)
All following functions write results to x
(pre-allocated) and return x
.
Sampling Algorithms (Non-Weighted)
StatsBase.direct_sample!
— Method.direct_sample!([rng], a::AbstractArray, x::AbstractArray)
Direct sampling: for each j
in 1:k
, randomly pick i
from 1:n
, and set x[j] = a[i]
, with n=length(a)
and k=length(x)
.
This algorithm consumes k
random numbers.
StatsBase.samplepair
— Function.samplepair([rng], n)
Draw a pair of distinct integers between 1 and n
without replacement.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
samplepair([rng], a)
Draw a pair of distinct elements from the array a
without replacement.
Optionally specify a random number generator rng
as the first argument (defaults to Random.GLOBAL_RNG
).
StatsBase.knuths_sample!
— Function.knuths_sample!([rng], a, x)
Knuth's Algorithm S for random sampling without replacement.
Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.
This algorithm consumes length(a)
random numbers. It requires no additional memory space. Suitable for the case where memory is tight.
StatsBase.fisher_yates_sample!
— Function.fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)
Fisher-Yates shuffling (with early termination).
Pseudo-code:
n = length(a)
k = length(x)
# Create an array of the indices
inds = collect(1:n)
for i = 1:k
# swap element `i` with another random element in inds[i:n]
# set element `i` in `x`
end
This algorithm consumes k=length(x)
random numbers. It uses an integer array of length n=length(a)
internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n
is greater than k
. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling
StatsBase.self_avoid_sample!
— Function.self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)
Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.
This algorithm consumes about (or slightly more than) k=length(x)
random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a)
.
However, if k
is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.
StatsBase.seqsample_a!
— Function.seqsample_a!([rng], a::AbstractArray, x::AbstractArray)
Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.
This algorithm consumes $O(n)$ random numbers, with n=length(a)
. The outputs are ordered.
StatsBase.seqsample_c!
— Function.seqsample_c!([rng], a::AbstractArray, x::AbstractArray)
Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.
This algorithm consumes $O(k^2)$ random numbers, with k=length(x)
. The outputs are ordered.
Weighted Sampling Algorithms
StatsBase.direct_sample!
— Method.direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)
Direct sampling.
Draw each sample by scanning the weight vector.
Noting k=length(x)
and n=length(a)
, this algorithm:
consumes
k
random numbershas time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
requires no additional memory space.
StatsBase.alias_sample!
— Function.alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)
Alias method.
Build an alias table, and sample therefrom.
Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.
Noting k=length(x)
and n=length(a)
, this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.
StatsBase.naive_wsample_norep!
— Function.naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)
Naive implementation of weighted sampling without replacement.
It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.
Noting k=length(x)
and n=length(a)
, this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.
StatsBase.efraimidis_a_wsample_norep!
— Function.efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)
Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.
Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.
Noting k=length(x)
and n=length(a)
, this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.
StatsBase.efraimidis_ares_wsample_norep!
— Function.efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)
Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.
Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.
Noting k=length(x)
and n=length(a)
, this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.