# DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu

A density-based algorithm for discovering clusters in large spatial databases with noise.1996.

## Density Reachability

DBSCAN's definition of a cluster is based on the concept of *density reachability*: a point $q$ is said to be *directly density reachable* by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be *density reachable* by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

A cluster, which is a subset of the given set of points, satisfies two properties:

- All points within the cluster are mutually
*density-connected*, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$. - If a point is density-connected to any point of a cluster, it is also part of that cluster.

## Interface

There are two implementations of *DBSCAN* algorithm in this package (both provided by `dbscan`

function):

- Distance (adjacency) matrix-based. It requires $O(N^2)$ memory to run. Boundary points cannot be shared between the clusters.
- Adjacency list-based. The input is the $d \times n$ matrix of point coordinates. The adjacency list is built on the fly. The performance is much better both in terms of running time and memory usage. Returns a vector of
`DbscanCluster`

objects that contain the indices of the*core*and*boundary*points, making it possible to share the boundary points between multiple clusters.

`Clustering.dbscan`

— Function.`dbscan(D::AbstractMatrix, eps::Real, minpts::Int) -> DbscanResult`

Perform DBSCAN algorithm using the distance matrix `D`

.

**Arguments**

The following options control which points would be considered *density reachable*:

`eps::Real`

: the radius of a point neighborhood`minpts::Int`

: the minimum number of neighboring points (including itself) to qualify a point as a density point.

```
dbscan(points::AbstractMatrix, radius::Real,
[leafsize], [min_neighbors], [min_cluster_size]) -> Vector{DbscanCluster}
```

Cluster `points`

using the DBSCAN (density-based spatial clustering of applications with noise) algorithm.

**Arguments**

`points`

: the $d×n$ matrix of points.`points[:, j]`

is a $d$-dimensional coordinates of $j$-th point`radius::Real`

: query radius

Optional keyword arguments to control the algorithm:

`leafsize::Int`

(defaults to 20): the number of points binned in each leaf node in the`KDTree`

`min_neighbors::Int`

(defaults to 1): the minimum number of a*core*point neighbors`min_cluster_size::Int`

(defaults to 1): the minimum number of points in a valid cluster

**Example**

```
points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clusters = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
```

`Clustering.DbscanResult`

— Type.`DbscanResult <: ClusteringResult`

The output of `dbscan`

function (distance matrix-based implementation).

**Fields**

`seeds::Vector{Int}`

: indices of cluster starting points`assignments::Vector{Int}`

: vector of clusters indices, where each point was assigned to`counts::Vector{Int}`

: cluster sizes (number of assigned points)

`Clustering.DbscanCluster`

— Type.`DbscanCluster`

DBSCAN cluster returned by `dbscan`

function (point coordinates-based implementation)

**Fields**

`size::Int`

: number of points in a cluster (core + boundary)`core_indices::Vector{Int}`

: indices of points in the cluster*core*`boundary_indices::Vector{Int}`

: indices of points on the cluster*boundary*