Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclust — Function

hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
- :single (the default): use the minimum distance between any of the cluster members
- :average: use the mean distance between any of the cluster members
- :complete: use the maximum distance between any of the members
- :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
- :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
- :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
- :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)

source

Clustering.Hclust — Type

Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
- each row specifies the left and right subtrees (referenced by their $id$s) that are merged
- negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
- positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
D = rand(1000, 1000);
D += D'; # symmetric distance matrix (optional)
result = hclust(D, linkage=:single)

Hclust{Float64}([-621 -794; -377 -693; … ; -250 997; -837 998], [0.0015153382227456325, 0.0035391802366823155, 0.003931583816750739, 0.005466611686326894, 0.006093272470955369, 0.006650870396967101, 0.006908558004172338, 0.007074058622394297, 0.007220702731105755, 0.007965764581307377  …  0.10085948655771904, 0.10242660051773167, 0.10460159103411504, 0.1046229368662085, 0.10520856227935926, 0.10691049347918513, 0.10980536928061868, 0.11115515857677327, 0.11170173904449576, 0.12078202349902789], [837, 250, 538, 376, 221, 597, 285, 141, 93, 405  …  470, 922, 636, 385, 183, 433, 124, 336, 607, 913], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutree — Function

cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

k::Integer (optional) the number of desired clusters.
h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.