Hierarchical Clustering

# Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The `hclust` function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the `linkage` parameter):

``hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust``

Perform hierarchical clustering using the distance matrix `d` and the cluster `linkage` function.

Returns the dendrogram as a `Hclust` object.

Arguments

• `d::AbstractMatrix`: the pairwise distance matrix. \$d_{ij}\$ is the distance between \$i\$-th and \$j\$-th points.
• `linkage::Symbol`: cluster linkage function to use. `linkage` defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
• `:single` (the default): use the minimum distance between any of the cluster members
• `:average`: use the mean distance between any of the cluster members
• `:complete`: use the maximum distance between any of the members
• `:ward`: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
• `:ward_presquared`: same as `:ward`, but assumes that the distances in `d` are already squared.
• `uplo::Symbol` (optional): specifies whether the upper (`:U`) or the lower (`:L`) triangle of `d` should be used to get the distances. If not specified, the method expects `d` to be symmetric.
• `branchorder::Symbol` (optional): algorithm to order leaves and branches. The valid choices are:
• `:r` (the default): ordering based on the node heights and the original elements order (compatible with R's `hclust`)
• `:barjoseph` (or `:optimal`): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
``Hclust{T<:Real}``

The output of `hclust`, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's `hclust` class.

Fields

• `merges::Matrix{Int}`: \$N×2\$ matrix encoding subtree merges:
• each row specifies the left and right subtrees (referenced by their \$id\$s) that are merged
• negative subtree \$id\$ denotes the leaf node and corresponds to the data point at position \$-id\$
• positive \$id\$ denotes nontrivial subtree (the row `merges[id, :]` specifies its left and right subtrees)
• `linkage::Symbol`: the name of cluster linkage function used to construct the hierarchy (see `hclust`)
• `heights::Vector{T}`: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified `linkage`
• `order::Vector{Int}`: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: `hclust`.

source
``````using Clustering
D = rand(1000, 1000);
D += D'; # symmetric distance matrix (optional)
``Hclust{Float64}([-42 -580; -19 -258; … ; -148 997; -614 998], [0.0016173674817459016, 0.00210951586936603, 0.0038321943940431424, 0.004184439312639565, 0.004390955732485269, 0.005340002625028983, 0.005370893657345555, 0.005414226322506321, 0.005773556319037532, 0.006340761679045848  …  0.09478223194532731, 0.09539819718462916, 0.10084485833120627, 0.10410282088643674, 0.10552558060984962, 0.10597547132172291, 0.10719933634941348, 0.10798782810257634, 0.11299187022483492, 0.11619171220892821], [614, 148, 467, 346, 898, 339, 836, 551, 860, 167  …  21, 52, 844, 245, 768, 290, 563, 763, 11, 143], :single)``

The resulting dendrogram could be converted into disjoint clusters with the help of `cutree` function.

``cutree(hclu::Hclust; [k], [h]) -> Vector{Int}``

Cut the `hclu` dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector \$z\$ (\$z_i\$ is the index of the cluster for the \$i\$-th data point).

Arguments

• `k::Integer` (optional) the number of desired clusters.
• `h::Real` (optional) the height at which the tree is cut.

If both `k` and `h` are specified, it's guaranteed that the number of clusters is not less than `k` and their height is not above `h`.

See also: `hclust`

source