K-means
K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.
From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:
\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]
Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.
Clustering.kmeans
— Functionkmeans(X, k, [...]) -> KmeansResult
K-means clustering of the $d×n$ data matrix X
(each column of X
is a $d$-dimensional data point) into k
clusters.
Arguments
init
(defaults to:kmpp
): how cluster seeds should be initialized, could be one of the following:- a
Symbol
, the name of a seeding algorithm (see Seeding for a list of supported methods); - an instance of
SeedingAlgorithm
; - an integer vector of length $k$ that provides the indices of points to use as initial seeds.
- a
weights
: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)maxiter
,tol
,display
: see common options
Clustering.KmeansResult
— TypeKmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult
The output of kmeans
and kmeans!
.
Type parameters
C<:AbstractMatrix{<:AbstractFloat}
: type of thecenters
matrixD<:Real
: type of the assignment costWC<:Real
: type of the cluster weight
If you already have a set of initial center vectors, kmeans!
could be used:
Clustering.kmeans!
— Functionkmeans!(X, centers; [kwargs...]) -> KmeansResult
Update the current cluster centers
($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X
(each column of X
is a $d$-dimensional data point).
See kmeans
for the description of optional kwargs
.
Examples
using Clustering
# make a random dataset with 1000 random 5-dimensional points
X = rand(5, 1000)
# cluster X into 20 clusters using K-means
R = kmeans(X, 20; maxiter=200, display=:iter)
@assert nclusters(R) == 20 # verify the number of clusters
a = assignments(R) # get the assignments of points to clusters
c = counts(R) # get the cluster sizes
M = R.centers # get the cluster centers
5×20 Matrix{Float64}:
0.613434 0.827851 0.415703 0.761409 … 0.24174 0.416385 0.312106
0.215067 0.702396 0.773295 0.377902 0.478804 0.780879 0.187333
0.222829 0.717061 0.280821 0.232335 0.695157 0.797438 0.205332
0.267597 0.74069 0.770797 0.78535 0.196763 0.772653 0.680894
0.240089 0.670621 0.751962 0.390059 0.171474 0.21502 0.732506
Scatter plot of the K-means clustering results:
using RDatasets, Clustering, Plots
iris = dataset("datasets", "iris"); # load the data
features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = kmeans(features, 3); # run K-means for the 3 clusters
# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
color=:lightrainbow, legend=false)