DBSCAN
Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:
Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.
Density Reachability
DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.
The points within DBSCAN clusters are categorized into core (or seeds) and boundary:
- All points of the cluster core are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a core, there exists a point $o$ such that both $p$ and $q$ are density reachable from $o$.
- If a point is density-connected to any point of a cluster core, it is also part of the core.
- All points within the $\epsilon$-neighborhood of any core point, but not belonging to that core (i.e. not density reachable from the core), are considered cluster boundary.
Interface
The implementation of DBSCAN algorithm provided by dbscan
function supports the two ways of specifying clustering data:
- The $d \times n$ matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.
- The $n\times n$ matrix of precalculated pairwise point distances. It requires $O(n^2)$ memory and time to run.
Clustering.dbscan
— Functiondbscan(points::AbstractMatrix, radius::Real;
[metric=Euclidean()],
[min_neighbors=1], [min_cluster_size=1],
[nntree_kwargs...]) -> DbscanResult
Cluster points
using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.
Arguments
points
: whenmetric
is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; whenmetric=nothing
, the n×n matrix of pairwise distances between the pointsradius::Real
: neighborhood radius; points within this distance are considered neighbors
Optional keyword arguments to control the algorithm:
metric
(defaults toEuclidean()
): the points distance metric to use,nothing
meanspoints
is the n×n precalculated distance matrixmin_neighbors::Integer
(defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"min_cluster_size::Integer
(defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discardednntree_kwargs...
: parameters (likeleafsize
) for theKDTree
constructor
Example
points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
References:
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", KDD-1996, pp. 226–231.
- Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335
Clustering.DbscanResult
— TypeDbscanResult <: ClusteringResult
The output of dbscan
function.
Fields
clusters::Vector{DbscanCluster}
: clusters, length Kseeds::Vector{Int}
: indices of the first points of each cluster's core, length Kcounts::Vector{Int}
: cluster sizes (number of assigned points), length Kassignments::Vector{Int}
: vector of clusters indices, where each point was assigned to, length N
Clustering.DbscanCluster
— TypeDbscanCluster
DBSCAN cluster, part of DbscanResult
returned by dbscan
function.
Fields
size::Int
: number of points in a cluster (core + boundary)core_indices::Vector{Int}
: indices of points in the cluster core, a.k.a. seeds (have at leastmin_neighbors
neighbors in the cluster)boundary_indices::Vector{Int}
: indices of the cluster points outside of core