Data scientists in different fields use clustering methods to find natural &quot;similar&quot; groups of observations in their datasets. Popular clustering methods can be:

Centroid-based: Groups points into k groups based on their proximity to a certain centroid.

Graph-based: Groups vertices in the graph based on their connections.

Density-based: More flexible grouping based on the density or sparsity of data in nearby areas.

Hierarchical Density-Based Application Spatial Clustering w/ Noise (HDBSCAN) algorithm is a density-based clustering method that is robust to noise (takes points in sparse regions as cluster boundaries and labels some of them directly for noise). Density-based clustering methods, such as HDBSCAN, are able to find clusters of odd shapes and sizes—as opposed to centroid-based clustering methods such as k-means, k-medioids, or Gaussian mixture models, which find a set of k centroids that model clusters as spheres of fixed shape and size. In addition to having to specify k in advance, the performance and simplicity of centroid-based algorithms help them remain one of the most popular methods for clustering points in high dimensions; even without modifying the input data points, they cannot or density of clusters to model.

Figure 1: K-means assumes that the data can be modeled with a fixed-size Gaussian sphere and slices the satellites, rather than clustering them individually. K-means assigns each point to a cluster, even in the presence of noise and outliers that can affect the centroid s of the result.

Figure 2: Density-based algorithms address this problem by expanding clusters from salient neighborhoods of dense regions. DBSCAN and HDBSCAN can interpret these points as noise and label them as noise, as the purple points in the figure.

HDBSCAN builds on a well-known density-based clustering algorithm, DBSCAN, which does not require the number of clusters to be known in advance, but still suffers from the unfortunate disadvantage of assuming that clusters can be modeled with a global density threshold. This makes it difficult to model clusters with different densities. HDBSCAN ameliorates this shortcoming by using single-chain clusters to build a dendrogram, allowing clusters of different densities to be found. Another well-known density-based clustering method is called the optical algorithm, which improves on DBSCAN and uses hierarchical clustering to find clusters with different densities. Optical techniques improve standard single-link clustering by projecting points into a new space, called the reachable space, that moves noise further away from dense regions, making it more tractable. However, like many other hierarchical agglomerative clustering methods (such as single-link clustering and fully-linked clustering), OPTICS suffers from the disadvantage of cutting the resulting dendrogram with a single global cut value. HDBSCAN is essentially optical + DBSCAN, introducing measures of cluster stability to slice dendrograms at different levels.

We will demonstrate the features currently supported in the RAPIDS cuML implementation of HDBSCAN with a quick example, and will provide some real-world examples and benchmarks of our implementation on the GPU. After reading this blog post, we hope you are excited about the benefits that RAPIDS' GPU – Accelerated HDBSCAN implementation can bring to your workflow and exploratory data analysis process.

Getting Started with HDBSCAN in RAPIDS

GPU provides a set of RAPIDS – Accelerated CPU Libraries that can almost replace many popular libraries in the PyData ecosystem. The example notebook below demonstrates API compatibility between the most widely used HDBSCAN Python library on Python and RAPIDS cuML HDBSCAN on GPU (spoiler alert – in many cases it is as simple as changing an import).

Below is a very simple example that demonstrates the benefits of density-based clustering over centroid-based techniques for certain types of data, and the benefits of using HDBSCAN over DBSCAN.

Application of HDBSCAN in practice

Density-based clustering techniques are naturally suitable for many different clustering tasks because they are able to find clusters of odd shapes and sizes. As with many other general-purpose machine learning algorithms, there is no free lunch, so while HDBSCAN improves upon some mature algorithms, it is still not the best tool for the job. Nonetheless, DBSCAN and HDBSCAN have achieved remarkable success in applications ranging from geospatial and collaborative filtering/recommendation systems to financial and scientific computing, being used in disciplines ranging from astronomy to accelerator physics to genomics. Its robustness to noise also makes it useful for outlier and anomaly detection applications.

As with many other tools in the data analysis and machine learning ecosystem, computation time has a large impact on production systems and iterative workflows. A faster HDBSCAN means being able to try out more ideas and make better models. Below are several example notebooks using HDBSCAN to cluster word embeddings and single-cell RNA gene expression. These are for brevity and provide a good starting point for using HDBSCAN for your own datasets. Have you successfully implemented HDBSCAN in industrial or scientific fields that we do not list here? Please leave a comment as we would love to hear. If you are running the example laptop on your own hardware, please also let us know your setup and your experience with RAPIDS.

word embedding

Vector embeddings represent a popular and very widespread machine learning application for clustering. We chose the GoogleNews dataset because it is large enough to give a good indication of the scale of our algorithm, but small enough to execute on a single machine. The following notebook demonstrates how to use HDBSCAN to find meaningful topics from high-density regions in the angular space of word embeddings, and visualize the resulting topic clusters using UMAP. It uses a subset of the entire dataset for visualization, but provides a good demonstration for tuning different hyperparameters and getting familiar with their effect on the resulting clusters. We benchmarked the entire dataset with default hyperparameter settings (shape 3Mx300) and stopped the Scikit learn contrib implementation on CPU after 24 hours. The implementation of RAPIDS takes approximately 22.8 minutes.

single cell RNA

Below is an example workflow based on the tutorial notebooks in the Scan and Club library. This example notebook is taken from the RAPIDS single-cell examples repository, which also contains several notebooks that demonstrate the use of RAPIDS for single-cell and tertiary analysis. On DGX-1 (Intel 40 core Xeon CPU + NVIDIA V100 GPU), we found that HDBSCAN (~1s on GPU instead of ~29s with multiple CPU threads) used a gene expression containing ~70k lung cells The first 50 principal components on the dataset, with a 29x speedup.

The RAPIDS CUML project includes end-to-end GPU-accelerated HDBSCAN and provides Python and C++ APIs. Like many neighborhood-based algorithms in cuML, it leverages the brute-force kNN from Facebook's Fisku to accelerate the construction of kNN graphs in mutually reachable spaces. This is currently a major bottleneck and we are investigating ways to further improve it with exact and approximate nearest neighbor options.

CUML also includes an implementation of monolinkage hierarchical clustering, which provides C++ and Python APIs. GPU – Accelerating a single link algorithm requires a new primitive to compute the minimum spanning tree. This primitive is based on graphs, so it can be reused in the cugraph and cuml libraries. Our implementation allows restarting so we can connect a disconnected knn graph and improves scalability by not having to store the entire pairwise distance matrix in GPU memory.

As with C++ in most CUML algorithms, these rely on our most ML-based and primitive-based [VZX27]. In the end, they took advantage of the great work done by Leland McInnes and John Healey to the GPU – even speeding up the cluster compression and selection steps, keeping as much data as possible on the GPU, and scaling at data scale Provides an additional performance boost to the millions.

benchmark

We used the benchmark notebook provided by the reference implementation on CPU by McInnes et al. to compare it with the new GPU implementation of cuML. The reference implementation is highly optimized for the low-dimensional case, and we compared the high-dimensional case with a brute force implementation that heavily uses Facebook's FAISS library.

Figure 4: GPU-accelerated HDBSCAN maintains near-interactive performance even for large datasets while eliminating CPU-limited parallelism. With these speeds, you'll find that you have more time for other things, like better understanding your data.

Benchmarks were performed on a DGX-1 containing 40-core Intel Xeon CPUs and NVIDIA 32gb V100 GPUs. Even with linear scaling of the number of dimensions and quadratic scaling of the number of rows, we observe that the GPU still maintains close to interactive performance, even when the number of rows exceeds 1M.

What has changed?

While we have successfully implemented the core of the HDBSCAN algorithm on GPU, there are still opportunities to further improve its performance, such as by speeding up brute-force kNN graph construction to remove distance computations, or even using approximate kNNs. While Euclidean distance covers the broadest range of uses, we would also like to expose other distance metrics provided in the Scikit-Learn Contrib implementation.

The scikit learn contrib implementation also contains many nice additional features not covered in the seminal paper on HDBSCAN, such as semi-supervised and fuzzy clustering. We also have solid single linkage and building blocks for optical algorithms that will be a good addition to RAPIDS in the future. Finally, we hope to support sparse input in the future.

If you find one or more of these features to make your application or data analysis project more successful, even if they are not listed here, please go to our Github project and create an issue.

generalize

HDBSCAN is a relatively new density-based clustering algorithm &quot;standing on the shoulders of giants&quot;, improving upon the well-known DBSCAN and optical algorithms. In fact, its core primitives also increase reuse and provide building blocks for other algorithms such as graph-based minimum spanning trees and single-link clustering in RAPIDS ML and galleries.

Like other data modeling algorithms, HDBSCAN is not the perfect tool for all jobs, but it has many practical uses in industrial and scientific computing applications. It can also be used with dimensionality reduction algorithms such as PCA or UMAP, especially in exploratory data analysis applications.