DBSCAN

# DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.

## Density Reachability

DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

A cluster, which is a subset of the given set of points, satisfies two properties:

1. All points within the cluster are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$.
2. If a point is density-connected to any point of a cluster, it is also part of that cluster.

## Interface

There are two implementations of DBSCAN algorithm in this package (both provided by dbscan function):

• Distance (adjacency) matrix-based. It requires $O(N^2)$ memory to run. Boundary points cannot be shared between the clusters.
• Adjacency list-based. The input is the $d \times n$ matrix of point coordinates. The adjacency list is built on the fly. The performance is much better both in terms of running time and memory usage. Returns a vector of DbscanCluster objects that contain the indices of the core and boundary points, making it possible to share the boundary points between multiple clusters.
dbscan(D::AbstractMatrix, eps::Real, minpts::Int) -> DbscanResult

Perform DBSCAN algorithm using the distance matrix D.

Arguments

The following options control which points would be considered density reachable:

• eps::Real: the radius of a point neighborhood
• minpts::Int: the minimum number of neighboring points (including itself) to qualify a point as a density point.
source
dbscan(points::AbstractMatrix, radius::Real,
[leafsize], [min_neighbors], [min_cluster_size]) -> Vector{DbscanCluster}

Cluster points using the DBSCAN (density-based spatial clustering of applications with noise) algorithm.

Arguments

• points: the $d×n$ matrix of points. points[:, j] is a $d$-dimensional coordinates of $j$-th point
• radius::Real: query radius

Optional keyword arguments to control the algorithm:

• leafsize::Int (defaults to 20): the number of points binned in each leaf node in the KDTree
• min_neighbors::Int (defaults to 1): the minimum number of a core point neighbors
• min_cluster_size::Int (defaults to 1): the minimum number of points in a valid cluster

Example

points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clusters = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
source
DbscanResult <: ClusteringResult

The output of dbscan function (distance matrix-based implementation).

Fields

• seeds::Vector{Int}: indices of cluster starting points
• assignments::Vector{Int}: vector of clusters indices, where each point was assigned to
• counts::Vector{Int}: cluster sizes (number of assigned points)
source
DbscanCluster

DBSCAN cluster returned by dbscan function (point coordinates-based implementation)

Fields

• size::Int: number of points in a cluster (core + boundary)
• core_indices::Vector{Int}: indices of points in the cluster core
• boundary_indices::Vector{Int}: indices of points on the cluster boundary
source