# Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The `hclust`

function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the `linkage`

parameter):

`Clustering.hclust`

— Function.`hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust`

Perform hierarchical clustering using the distance matrix `d`

and the cluster `linkage`

function.

Returns the dendrogram as a `Hclust`

object.

**Arguments**

`d::AbstractMatrix`

: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.`linkage::Symbol`

:*cluster linkage*function to use.`linkage`

defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:`:single`

(the default): use the minimum distance between any of the cluster members`:average`

: use the mean distance between any of the cluster members`:complete`

: use the maximum distance between any of the members`:ward`

: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters`:ward_presquared`

: same as`:ward`

, but assumes that the distances in`d`

are already squared.

`uplo::Symbol`

(optional): specifies whether the upper (`:U`

) or the lower (`:L`

) triangle of`d`

should be used to get the distances. If not specified, the method expects`d`

to be symmetric.`branchorder::Symbol`

(optional): algorithm to order leaves and branches. The valid choices are:`:r`

(the default): ordering based on the node heights and the original elements order (compatible with R's`hclust`

)`:barjoseph`

(or`:optimal`

): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al.*Bioinformatics*(2001)

`Clustering.Hclust`

— Type.`Hclust{T<:Real}`

The output of `hclust`

, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's `hclust`

class.

**Fields**

`merges::Matrix{Int}`

: $N×2$ matrix encoding subtree merges:- each row specifies the left and right subtrees (referenced by their $id$s) that are merged
- negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
- positive $id$ denotes nontrivial subtree (the row
`merges[id, :]`

specifies its left and right subtrees)

`linkage::Symbol`

: the name of*cluster linkage*function used to construct the hierarchy (see`hclust`

)`heights::Vector{T}`

: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified`linkage`

`order::Vector{Int}`

: the data point indices ordered so that there are no intersecting branches on the*dendrogram*plot. This ordering also puts the points of the same cluster close together.

See also: `hclust`

.

```
using Clustering
D = rand(1000, 1000);
D += D'; # symmetric distance matrix (optional)
result = hclust(D, linkage=:single)
```

`Hclust{Float64}([-132 -266; -261 -877; … ; -396 997; -545 998], [0.00201865793089695, 0.00302232052798157, 0.003147694942261303, 0.004192706798883616, 0.0044699172799025355, 0.004557601862652083, 0.004947635085597479, 0.005934324675376912, 0.006897745727283633, 0.006971924432854326 … 0.09582142144320538, 0.09622907440242923, 0.09623031343815613, 0.0997659461960756, 0.10075265316326432, 0.10448992978704208, 0.10591775650764768, 0.10960801438926349, 0.11664284730638053, 0.11723239364496307], [545, 396, 829, 165, 815, 518, 315, 334, 528, 433 … 437, 818, 503, 666, 971, 260, 835, 394, 343, 400], :single)`

The resulting dendrogram could be converted into disjoint clusters with the help of `cutree`

function.

`Clustering.cutree`

— Function.`cutree(hclu::Hclust; [k], [h]) -> Vector{Int}`

Cut the `hclu`

dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

**Arguments**

`k::Integer`

(optional) the number of desired clusters.`h::Real`

(optional) the height at which the tree is cut.

If both `k`

and `h`

are specified, it's guaranteed that the number of clusters is not less than `k`

and their height is not above `h`

.

See also: `hclust`