Initialization
A clustering algorithm usually requires initialization before it could be started.
Seeding
Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).
Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm
:
Clustering.SeedingAlgorithm
— TypeSeedingAlgorithm
Base type for all seeding algorithms.
Each seeding algorithm should implement the two functions: initseeds!
and initseeds_by_costs!
.
Clustering.initseeds!
— Functioninitseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
X::AbstractMatrix) -> iseeds
Initialize iseeds
with the indices of cluster seeds for the X
data matrix using the alg
seeding algorithm.
Clustering.initseeds_by_costs!
— Functioninitseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
costs::AbstractMatrix) -> iseeds
Initialize iseeds
with the indices of cluster seeds for the costs
matrix using the alg
seeding algorithm.
Here, costs[i, j]
is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.
There are several seeding methods described in the literature. Clustering.jl implements three popular ones:
Clustering.KmppAlg
— TypeKmppAlg <: SeedingAlgorithm
Kmeans++ seeding (:kmpp
).
Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.
References
D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.
Clustering.KmCentralityAlg
— TypeKmCentralityAlg <: SeedingAlgorithm
K-medoids initialization based on centrality (:kmcen
).
Choose the $k$ points with the highest centrality as seeds.
References
Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039
Clustering.RandSeedAlg
— TypeRandSeedAlg <: SeedingAlgorithm
Random seeding (:rand
).
Chooses an arbitrary subset of $k$ data points as cluster seeds.
In practice, we have found that Kmeans++ is the most effective choice.
For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds
and applying the proper SeedingAlgorithm
:
Clustering.initseeds
— Functioninitseeds(alg::Union{SeedingAlgorithm, Symbol},
X::AbstractMatrix, k::Integer) -> Vector{Int}
Select k
seeds from a $d×n$ data matrix X
using the alg
algorithm.
alg
could be either an instance of SeedingAlgorithm
or a symbolic name of the algorithm.
Returns the vector of k
seed indices.
Clustering.initseeds_by_costs
— Functioninitseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
costs::AbstractMatrix, k::Integer) -> Vector{Int}
Select k
seeds from the $n×n$ costs
matrix using algorithm alg
.
Here, costs[i, j]
is the cost of assigning points i
and
j
` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.
Returns the vector of k
seed indices.