What is DBSCAN clustering algorithm?

  • There are different approaches and algorithms to perform clustering tasks which can be divided into three sub-categories:

    • Partition-based clustering: k-means, k-median
    • Hierarchical clustering: agglomerative, divisive
    • Density-based clustering: DBSCAN
  • DBSCAN stands for density-based spatial clustering of applications with noise. We suppose that you have been familiar with k-means clustering. Before plunging into the details of the DBSCAN algorithm, let's show you a figure which will give you some inspiration on why we need DBSCAN when we already have k-means clustering.

  • Classification according to DBSCAN clustering is more intuitive. What makes DBSCAN more attractive is that you don't have to specify the number of clusters in advance. Now let's see how it works.

  • The DBSCAN algorithm uses two parameters:

    • minPts: The minimum number of points clustered together for a region to be considered dense.
    • eps(\(\epsilon\)): A distance measure that is used to locate the points in the neighborhood of any point.
  • Points in the feature space are then classified as follows:

    • Core point: A point \(p\) is a core point if at least minPts points are within distance \(\epsilon\) of it. Those points are marked as directly reachable from \(p\).
    • Reachable point: A point \(q\) is reachable from \(p\), if there is a path of points \(p_{1}, \ldots, p_{n}\) with \(p_{1}=p\) and \(p_{n}=q\), where each \(p_{i+1}\) is directly reachable from \(p_{i}\)
    • Outliers: All non-reachable points from any other points are outliers.
  • DBSCAN clustering algorithm procedure in 3 steps:

    1. Arbitrarily pick one point in the dataset.
    2. If there are at least minPts points within a radius of \(\epsilon\) to that point, take all these points to the same cluster.
    3. The cluster is then expanded by recursively repeating the \(\epsilon\)-neighborhood calculation for each new-taken point. Continuing the expansion until there is no reachable point can be taken in this cluster.
    4. Pick a new point outside the existing clusters, repeat Step 2 and Step 3, until all points in the dataset have been considered.
  • Please note that, when we say "distance" it doesn't have to be Euclidean distance, it also could be any other metric that is appropriate for the classification problem to be solved.

  • Python code for DBSCAN clustering:

from sklearn.cluster import DBSCAN
dbCluster = DBSCAN(eps=0.3, min_samples=10)
dbCluster.fit(X_train)

Announcement: The content above is credited to many resources. All blogs on this website are study notes, not copyright publications.