By Abhijit Annaldas, Microsoft.
DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which won’t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don’t know how many clusters could be there in the data.
K-Means clustering may cluster loosely related observations together. Every observation becomes a part of some cluster eventually, even if the observations are scattered far away in the vector space. Since clusters depend on the mean value of cluster elements, each data point plays a role in forming the clusters. Slight change in data points might affect the clustering outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed.
In DBSCAN, clustering happens based on two important parameters viz.,
- neighbourhood (n) - cutoff distance of a point from (core point – discussed below) for it to be considered a part of a cluster. Commonly referred to as epsilon (abbreviated as eps).
- minimum points (m) - minimum number of points required to form a cluster. Commonly referred to as minPts.
There are three types of points after the DBSCAN clustering is complete viz.,
- Core - This is a point which has at least m points within distance n from itself.
- Border - This is a point which has at least one Core point at a distance n.
- Noise - This is a point which is neither a Core nor a Border. And it has less than m points within distance n from itself.
DBSCAN clustering can be summarized in following steps...
- For each point P in dataset, identify points pts within distance n.
- if pts >= m, label P as a Core point
- if pts < m and a core point is at distance n, label P a Border point
- if pts < m, label P a Noise point
- For the sake of explainability, lets refer to a Core point and all the points within distance n as a Core-Set. All the overlapping Core-Sets are grouped together into one cluster. Like multiple individual graphs being connected to form a set of connected graphs.
Since clustering entirely depends on the parameters n and m (above), choosing these values correctly is very important. While good domain knowledge of the subject helps choosing good values for these parameters, there are also some approaches where these parameters can be fairly approximated without deep expertise in the domain.
See DBSCAN demo in sklearn examples and try it with sklearn.cluster.DBSCAN in sklearn.
Bio: Abhijit Annaldas is a Software Engineer and a voracious learner who has acquired Machine Learning knowledge and expertise to a fair extent. He is improving expertise day by day by learning new stuﬀ and relentless practice, and has extensive experience building enterprise scale applications in diﬀerent Microsoft and Open Source technologies as a Software Engineer at Microsoft, India since June 2012.
Original. Reposted with permission.
- Comparing Distance Measurements with Python and SciPy
- Must-Know: How to determine the most useful number of clusters?
- Text Clustering: Get quick insights from Unstructured Data