Hierarchical Clustering

# Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The `hclust` function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the `linkage` parameter):

``hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust``

Perform hierarchical clustering using the distance matrix `d` and the cluster `linkage` function.

Returns the dendrogram as a `Hclust` object.

Arguments

• `d::AbstractMatrix`: the pairwise distance matrix. \$d_{ij}\$ is the distance between \$i\$-th and \$j\$-th points.
• `linkage::Symbol`: cluster linkage function to use. `linkage` defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
• `:single` (the default): use the minimum distance between any of the cluster members
• `:average`: use the mean distance between any of the cluster members
• `:complete`: use the maximum distance between any of the members
• `:ward`: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
• `:ward_presquared`: same as `:ward`, but assumes that the distances in `d` are already squared.
• `uplo::Symbol` (optional): specifies whether the upper (`:U`) or the lower (`:L`) triangle of `d` should be used to get the distances. If not specified, the method expects `d` to be symmetric.
• `branchorder::Symbol` (optional): algorithm to order leaves and branches. The valid choices are:
• `:r` (the default): ordering based on the node heights and the original elements order (compatible with R's `hclust`)
• `:barjoseph` (or `:optimal`): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
``Hclust{T<:Real}``

The output of `hclust`, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's `hclust` class.

Fields

• `merges::Matrix{Int}`: \$N×2\$ matrix encoding subtree merges:
• each row specifies the left and right subtrees (referenced by their \$id\$s) that are merged
• negative subtree \$id\$ denotes the leaf node and corresponds to the data point at position \$-id\$
• positive \$id\$ denotes nontrivial subtree (the row `merges[id, :]` specifies its left and right subtrees)
• `linkage::Symbol`: the name of cluster linkage function used to construct the hierarchy (see `hclust`)
• `heights::Vector{T}`: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified `linkage`
• `order::Vector{Int}`: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: `hclust`.

source
``````using Clustering
D = rand(1000, 1000);
D += D'; # symmetric distance matrix (optional)
``Hclust{Float64}([-132 -266; -261 -877; … ; -396 997; -545 998], [0.00201865793089695, 0.00302232052798157, 0.003147694942261303, 0.004192706798883616, 0.0044699172799025355, 0.004557601862652083, 0.004947635085597479, 0.005934324675376912, 0.006897745727283633, 0.006971924432854326  …  0.09582142144320538, 0.09622907440242923, 0.09623031343815613, 0.0997659461960756, 0.10075265316326432, 0.10448992978704208, 0.10591775650764768, 0.10960801438926349, 0.11664284730638053, 0.11723239364496307], [545, 396, 829, 165, 815, 518, 315, 334, 528, 433  …  437, 818, 503, 666, 971, 260, 835, 394, 343, 400], :single)``

The resulting dendrogram could be converted into disjoint clusters with the help of `cutree` function.

``cutree(hclu::Hclust; [k], [h]) -> Vector{Int}``

Cut the `hclu` dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector \$z\$ (\$z_i\$ is the index of the cluster for the \$i\$-th data point).

Arguments

• `k::Integer` (optional) the number of desired clusters.
• `h::Real` (optional) the height at which the tree is cut.

If both `k` and `h` are specified, it's guaranteed that the number of clusters is not less than `k` and their height is not above `h`.

See also: `hclust`

source