K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering

# make a random dataset with 1000 random 5-dimensional points
X = rand(5, 1000)

# cluster X into 20 clusters using K-means
R = kmeans(X, 20; maxiter=200, display=:iter)

@assert nclusters(R) == 20 # verify the number of clusters

a = assignments(R) # get the assignments of points to clusters
c = counts(R) # get the cluster sizes
M = R.centers # get the cluster centers
5×20 Matrix{Float64}:
 0.813234  0.760555  0.20344   0.736628  …  0.264632  0.274247  0.705199
 0.775943  0.771485  0.332745  0.732916     0.333384  0.809243  0.399583
 0.675965  0.182292  0.757112  0.803613     0.203145  0.462186  0.281596
 0.800403  0.415761  0.8025    0.196722     0.211672  0.23644   0.804999
 0.263039  0.589203  0.756311  0.750162     0.346085  0.702983  0.26619

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
iris = dataset("datasets", "iris"); # load the data

features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = kmeans(features, 3); # run K-means for the 3 clusters

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
        color=:lightrainbow, legend=false)
Example block output