DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.

Density Reachability

DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

The points within DBSCAN clusters are categorized into core (or seeds) and boundary:

All points of the cluster core are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a core, there exists a point $o$ such that both $p$ and $q$ are density reachable from $o$.
If a point is density-connected to any point of a cluster core, it is also part of the core.
All points within the $\epsilon$-neighborhood of any core point, but not belonging to that core (i.e. not density reachable from the core), are considered cluster boundary.

Interface

The implementation of DBSCAN algorithm provided by dbscan function supports the two ways of specifying clustering data:

The $d \times n$ matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.
The $n\times n$ matrix of precalculated pairwise point distances. It requires $O(n^2)$ memory and time to run.

Clustering.dbscan — Function

dbscan(points::AbstractMatrix, radius::Real;
       [metric=Euclidean()],
       [min_neighbors=1], [min_cluster_size=1],
       [nntree_kwargs...]) -> DbscanResult

Cluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

Arguments

points: when metric is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; when metric=nothing, the n×n matrix of pairwise distances between the points
radius::Real: neighborhood radius; points within this distance are considered neighbors

Optional keyword arguments to control the algorithm:

metric (defaults to Euclidean()): the points distance metric to use, nothing means points is the n×n precalculated distance matrix
min_neighbors::Integer (defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"
min_cluster_size::Integer (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded
nntree_kwargs...: parameters (like leafsize) for the KDTree constructor

Example

points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", KDD-1996, pp. 226–231.
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335

source

Clustering.DbscanResult — Type

DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

clusters::Vector{DbscanCluster}: clusters, length K
seeds::Vector{Int}: indices of the first points of each cluster's core, length K
counts::Vector{Int}: cluster sizes (number of assigned points), length K
assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N

source

Clustering.DbscanCluster — Type

DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

size::Int: number of points in a cluster (core + boundary)
core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
boundary_indices::Vector{Int}: indices of the cluster points outside of core

source