Basics
The package implements a variety of clustering algorithms:
- K-means
- K-medoids
- Hierarchical Clustering
- MCL (Markov Cluster Algorithm)
- Affinity Propagation
- DBSCAN
- Fuzzy C-means
Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.
Inputs
A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:
- Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (
X[:, i]
) is a data point (data sample) in $d$-dimensional space. - Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.
Common Options
Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:
maxiter::Integer
: maximum number of iterations.tol::Real
: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops belowtol
.display::Symbol
: the level of information to be displayed. It may take one of the following values::none
: nothing is shown:final
: only shows a brief summary when the algorithm ends:iter
: shows the progress at each iteration
Results
A clustering function would return an object (typically, an instance of some ClusteringResult
subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).
Clustering.ClusteringResult
— TypeClusteringResult
Base type for the output of clustering algorithm.
The following generic methods are supported by any subtype of ClusteringResult
:
Clustering.nclusters
— Methodnclusters(R::ClusteringResult) -> Int
Get the number of clusters.
StatsBase.counts
— Methodcounts(R::ClusteringResult) -> Vector{Int}
Get the vector of cluster sizes.
counts(R)[k]
is the number of points assigned to the $k$-th cluster.
Clustering.wcounts
— Methodwcounts(R::ClusteringResult) -> Vector{Float64}
wcounts(R::FuzzyCMeansResult) -> Vector{Float64}
Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.
For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R))
.
Clustering.assignments
— Methodassignments(R::ClusteringResult) -> Vector{Int}
Get the vector of cluster indices for each point.
assignments(R)[i]
is the index of the cluster to which the $i$-th point is assigned.