DBSCAN
Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:
Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.
Density Reachability
DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.
A cluster, which is a subset of the given set of points, satisfies two properties:
- All points within the cluster are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$.
- If a point is density-connected to any point of a cluster, it is also part of that cluster.
Interface
There are two implementations of DBSCAN algorithm in this package (both provided by dbscan
function):
- Distance (adjacency) matrix-based. It requires $O(N^2)$ memory to run. Boundary points cannot be shared between the clusters.
- Adjacency list-based. The input is the $d \times n$ matrix of point coordinates. The adjacency list is built on the fly. The performance is much better both in terms of running time and memory usage. Returns a vector of
DbscanCluster
objects that contain the indices of the core and boundary points, making it possible to share the boundary points between multiple clusters.
Clustering.dbscan
— Function.dbscan(D::AbstractMatrix, eps::Real, minpts::Int) -> DbscanResult
Perform DBSCAN algorithm using the distance matrix D
.
Arguments
The following options control which points would be considered density reachable:
eps::Real
: the radius of a point neighborhoodminpts::Int
: the minimum number of neighboring points (including itself) to qualify a point as a density point.
dbscan(points::AbstractMatrix, radius::Real,
[leafsize], [min_neighbors], [min_cluster_size]) -> Vector{DbscanCluster}
Cluster points
using the DBSCAN (density-based spatial clustering of applications with noise) algorithm.
Arguments
points
: the $d×n$ matrix of points.points[:, j]
is a $d$-dimensional coordinates of $j$-th pointradius::Real
: query radius
Optional keyword arguments to control the algorithm:
leafsize::Int
(defaults to 20): the number of points binned in each leaf node in theKDTree
min_neighbors::Int
(defaults to 1): the minimum number of a core point neighborsmin_cluster_size::Int
(defaults to 1): the minimum number of points in a valid cluster
Example
points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clusters = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
Clustering.DbscanResult
— Type.DbscanResult <: ClusteringResult
The output of dbscan
function (distance matrix-based implementation).
Fields
seeds::Vector{Int}
: indices of cluster starting pointsassignments::Vector{Int}
: vector of clusters indices, where each point was assigned tocounts::Vector{Int}
: cluster sizes (number of assigned points)
Clustering.DbscanCluster
— Type.DbscanCluster
DBSCAN cluster returned by dbscan
function (point coordinates-based implementation)
Fields
size::Int
: number of points in a cluster (core + boundary)core_indices::Vector{Int}
: indices of points in the cluster coreboundary_indices::Vector{Int}
: indices of points on the cluster boundary