DBSCAN

DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.

Density Reachability

DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

A cluster, which is a subset of the given set of points, satisfies two properties:

  1. All points within the cluster are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$.
  2. If a point is density-connected to any point of a cluster, it is also part of that cluster.

Interface

There are two implementations of DBSCAN algorithm in this package (both provided by dbscan function):

Clustering.dbscanFunction.
dbscan(D::DenseMatrix, eps::Real, minpts::Int) -> DbscanResult

Perform DBSCAN algorithm using the distance matrix D.

Arguments

The following options control which points would be considered density reachable:

  • eps::Real: the radius of a point neighborhood
  • minpts::Int: the minimum number of neighboring points (including itself) to qualify a point as a density point.
source
dbscan(points::AbstractMatrix, radius::Real,
       [leafsize], [min_neighbors], [min_cluster_size]) -> Vector{DbscanCluster}

Cluster points using the DBSCAN (density-based spatial clustering of applications with noise) algorithm.

Arguments

  • points: the $d×n$ matrix of points. points[:, j] is a $d$-dimensional coordinates of $j$-th point
  • radius::Real: query radius

Optional keyword arguments to control the algorithm:

  • leafsize::Int (defaults to 20): the number of points binned in each leaf node in the KDTree
  • min_neighbors::Int (defaults to 1): the minimum number of a core point neighbors
  • min_cluster_size::Int (defaults to 1): the minimum number of points in a valid cluster

Example

points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clusters = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
source
DbscanResult <: ClusteringResult

The output of dbscan function (distance matrix-based implementation).

Fields

  • seeds::Vector{Int}: indices of cluster starting points
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to
  • counts::Vector{Int}: cluster sizes (number of assigned points)
source
DbscanCluster

DBSCAN cluster returned by dbscan function (point coordinates-based implementation)

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core
  • boundary_indices::Vector{Int}: indices of points on the cluster boundary
source