# Evaluation & Validation

*Clustering.jl* package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.

## Cross tabulation

Cross tabulation, or *contingency matrix*, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

*Clustering.jl* extends `StatsBase.counts()`

with methods that accept `ClusteringResult`

arguments:

`StatsBase.counts`

— Method.```
counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}
counts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}
counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}
```

Calculate the *cross tabulation* (aka *contingency matrix*) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix `C`

, where $n_a$ and $n_b$ are the numbers of clusters in `a`

and `b`

, respectively, and `C[i, j]`

is the size of the intersection of `i`

-th cluster from `a`

and `j`

-th cluster from `b`

.

The clusterings could be specified either as `ClusteringResult`

instances or as vectors of data point assignments.

## Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

`Clustering.randindex`

— Function.`randindex(a, b) -> NTuple{4, Float64}`

Compute the tuple of Rand-related indices between the clusterings `c1`

and `c2`

.

`a`

and `b`

can be either `ClusteringResult`

instances or assignments vectors (`AbstractVector{<:Integer}`

).

Returns a tuple of indices:

- Hubert & Arabie Adjusted Rand index
- Rand index (agreement probability)
- Mirkin's index (disagreement probability)
- Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

**References**

Lawrence Hubert and Phipps Arabie (1985).

Comparing partitions.Journal of Classification 2 (1): 193–218

Meila, Marina (2003).

Comparing Clusterings by the Variation of Information.Learning Theory and Kernel Machines: 173–187.

## Silhouettes

Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The *Silhouette* value for the $i$-th data point is:

- $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
- $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using `mean(silhouettes(assignments, counts, X))`

as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

`Clustering.silhouettes`

— Function.```
silhouettes(assignments::AbstractVector, [counts,] dists) -> Vector{Float64}
silhouettes(clustering::ClusteringResult, dists) -> Vector{Float64}
```

Compute *silhouette* values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

**Arguments**

`assignments::AbstractVector{Int}`

: the vector of point assignments (cluster indices)`counts::AbstractVector{Int}`

: the optional vector of cluster sizes (how many points assigned to each cluster; should match`assignments`

)`clustering::ClusteringResult`

: the output of some clustering method`dists::AbstractMatrix`

: $n×n$ matrix of pairwise distances between the points

**References**

Peter J. Rousseeuw (1987).

Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65.

## Variation of Information

Variation of information (also known as *shared information distance*) is a measure of the distance between the two clusterings. It is devised from the *mutual information*, but it is a true metric, *i.e.* it is symmetric and satisfies the triangle inequality.

`Clustering.varinfo`

— Function.`varinfo(a, b) -> Float64`

Compute the *variation of information* between the two clusterings of the same data points.

`a`

and `b`

can be either `ClusteringResult`

instances or assignments vectors (`AbstractVector{<:Integer}`

).

**References**

Meila, Marina (2003).

Comparing Clusterings by the Variation of Information.Learning Theory and Kernel Machines: 173–187.

## V-measure

*V*-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, *completeness* has more weight, and when $\beta < 1$ it's *homogeneity*.

`Clustering.vmeasure`

— Function.`vmeasure(a, b; [β = 1.0]) -> Float64`

V-measure between the two clusterings.

`a`

and `b`

can be either `ClusteringResult`

instances or assignments vectors (`AbstractVector{<:Integer}`

).

The `β`

parameter defines trade-off between *homogeneity* and *completeness*:

- if $β > 1$,
*completeness*is weighted more strongly, - if $β < 1$,
*homogeneity*is weighted more strongly.

**References**

Andrew Rosenberg and Julia Hirschberg, 2007.

V-Measure: A conditional entropy-based external cluster evaluation measure

## Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

`Clustering.mutualinfo`

— Function.`mutualinfo(a, b; normed=true) -> Float64`

Compute the *mutual information* between the two clusterings of the same data points.

`a`

and `b`

can be either `ClusteringResult`

instances or assignments vectors (`AbstractVector{<:Integer}`

).

If `normed`

parameter is `true`

the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

**References**

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.