Data scientists in different fields use clustering methods to find natural "similar" groups of observations in their datasets. Popular clustering methods can be:

Centroid-based: Groups points into k groups based on their proximity to a certain centroid.

Graph-based: Groups vertices in the graph based on their connections.

Density-based: More flexible grouping based on the density or sparsity of data in nearby areas.

Hierarchical Density-Based Application Spatial Clustering w/ Noise (HDBSCAN) algorithm is a density-based clustering method that is robust to noise (takes points in sparse regions as cluster boundaries and labels some of them directly for noise). Density-based clustering methods, such as HDBSCAN, are able to find clusters of odd shapes and sizes—as opposed to centroid-based clustering methods such as k-means, k-medioids, or Gaussian mixture models, which find a set of k centroids that model clusters as spheres of fixed shape and size. In addition to having to specify k in advance, the performance and simplicity of centroid-based algorithms help them remain one of the most popular methods for clustering points in high dimensions; even without modifying the input data points, they cannot or density of clusters to model.

pYYBAGJdHjmAD4kbAAEFTjTlJ4c123.png

Figure 1: K-means assumes that the data can be modeled with a fixed-size Gaussian sphere and slices the satellites, rather than clustering them individually. K-means assigns each point to a cluster, even in the presence of noise and outliers that can affect the centroid s of the result.

pYYBAGJdHjqAIOvrAADy3drSfJI146.png

Figure 2: Density-based algorithms address this problem by expanding clusters from salient neighborhoods of dense regions. DBSCAN and HDBSCAN can interpret these points as noise and label them as noise, as the purple points in the figure.

HDBSCAN builds on a well-known density-based clustering algorithm, DBSCAN, which does not require the number of clusters to be known in advance, but still suffers from the unfortunate disadvantage of assuming that clusters can be modeled with a global density threshold. This makes it difficult to model clusters with different densities. HDBSCAN ameliorates this shortcoming by using single-chain clusters to build a dendrogram, allowing clusters of different densities to be found. Another well-known density-based clustering method is called the optical algorithm, which improves on DBSCAN and uses hierarchical clustering to find clusters with different densities. Optical techniques improve standard single-link clustering by projecting points into a new space, called the reachable space, that moves noise further away from dense regions, making it more tractable. However, like many other hierarchical agglomerative clustering methods (such as single-link clustering and fully-linked clustering), OPTICS suffers from the disadvantage of cutting the resulting dendrogram with a single global cut value. HDBSCAN is essentially optical + DBSCAN, introducing measures of cluster stability to slice dendrograms at different levels.

We will demonstrate the features currently supported in the RAPIDS cuML implementation of HDBSCAN with a quick example, and will provide some real-world examples and benchmarks of our implementation on the GPU. After reading this blog post, we hope you are excited about the benefits that RAPIDS' GPU – Accelerated HDBSCAN implementation can bring to your workflow and exploratory data analysis process.

Getting Started with HDBSCAN in RAPIDS

GPU provides a set of RAPIDS – Accelerated CPU Libraries that can almost replace many popular libraries in the PyData ecosystem. The example notebook below demonstrates API compatibility between the most widely used HDBSCAN Python library on Python and RAPIDS cuML HDBSCAN on GPU (spoiler alert – in many cases it is as simple as changing an import).


Below is a very simple example that demonstrates the benefits of density-based clustering over centroid-based techniques for certain types of data, and the benefits of using HDBSCAN over DBSCAN.

Application of HDBSCAN in practice

Density-based clustering techniques are naturally suitable for many different clustering tasks because they are able to find clusters of odd shapes and sizes. As with many other general-purpose machine learning algorithms, there is no free lunch, so while HDBSCAN improves upon some mature algorithms, it is still not the best tool for the job. Nonetheless, DBSCAN and HDBSCAN have achieved remarkable success in applications ranging from geospatial and collaborative filtering/recommendation systems to financial and scientific computing, being used in disciplines ranging from astronomy to accelerator physics to genomics. Its robustness to noise also makes it useful for outlier and anomaly detection applications.

As with many other tools in the data analysis and machine learning ecosystem, computation time has a large impact on production systems and iterative workflows. A faster HDBSCAN means being able to try out more ideas and make better models. Below are several example notebooks using HDBSCAN to cluster word embeddings and single-cell RNA gene expression. These are for brevity and provide a good starting point for using HDBSCAN for your own datasets. Have you successfully implemented HDBSCAN in industrial or scientific fields that we do not list here? Please leave a comment as we would love to hear. If you are running the example laptop on your own hardware, please also let us know your setup and your experience with RAPIDS.

word embedding

Vector embeddings represent a popular and very widespread machine learning application for clustering. We chose the GoogleNews dataset because it is large enough to give a good indication of the scale of our algorithm, but small enough to execute on a single machine. The following notebook demonstrates how to use HDBSCAN to find meaningful topics from high-density regions in the angular space of word embeddings, and visualize the resulting topic clusters using UMAP. It uses a subset of the entire dataset for visualization, but provides a good demonstration for tuning different hyperparameters and getting familiar with their effect on the resulting clusters. We benchmarked the entire dataset with default hyperparameter settings (shape 3Mx300) and stopped the Scikit learn contrib implementation on CPU after 24 hours. The implementation of RAPIDS takes approximately 22.8 minutes.


单细胞 RNA

下面是一个基于扫描和俱乐部库中教程笔记本的工作流示例。本示例笔记本取自 RAPIDS 单单元示例存储库,其中还包含几个笔记本,演示了 RAPIDS 用于单细胞和三级分析。在 DGX-1 (英特尔 40 核至强 CPU + NVIDIA V100 GPU )上,我们发现 HDBSCAN (在 GPU 上是~ 1s ,而不是具有多个 CPU 线程的~ 29s )使用了包含~ 70k 肺细胞基因表达的数据集上的前 50 个主成分,加速了 29x 。

The RAPIDS CUML project includes end-to-end GPU-accelerated HDBSCAN and provides Python and C++ APIs. Like many neighborhood-based algorithms in cuML, it leverages the brute-force kNN from Facebook's Fisku to accelerate the construction of kNN graphs in mutually reachable spaces. This is currently a major bottleneck and we are investigating ways to further improve it with exact and approximate nearest neighbor options.

CUML also includes an implementation of monolinkage hierarchical clustering, which provides C++ and Python APIs. GPU – Accelerating a single link algorithm requires a new primitive to compute the minimum spanning tree. This primitive is based on graphs, so it can be reused in the cugraph and cuml libraries. Our implementation allows restarting so we can connect a disconnected knn graph and improves scalability by not having to store the entire pairwise distance matrix in GPU memory.

As with C++ in most CUML algorithms, these rely on our most ML-based and primitive-based [VZX27]. In the end, they took advantage of the great work done by Leland McInnes and John Healey to the GPU – even speeding up the cluster compression and selection steps, keeping as much data as possible on the GPU, and scaling at data scale Provides an additional performance boost to the millions.

benchmark

We used the benchmark notebook provided by the reference implementation on CPU by McInnes et al. to compare it with the new GPU implementation of cuML. The reference implementation is highly optimized for the low-dimensional case, and we compared the high-dimensional case with a brute force implementation that heavily uses Facebook's FAISS library.

Figure 4: GPU-accelerated HDBSCAN maintains near-interactive performance even for large datasets while eliminating CPU-limited parallelism. With these speeds, you'll find that you have more time for other things, like better understanding your data.

Benchmarks were performed on a DGX-1 containing 40-core Intel Xeon CPUs and NVIDIA 32gb V100 GPUs. Even with linear scaling of the number of dimensions and quadratic scaling of the number of rows, we observe that the GPU still maintains close to interactive performance, even when the number of rows exceeds 1M.

What has changed?

While we have successfully implemented the core of the HDBSCAN algorithm on GPU, there are still opportunities to further improve its performance, such as by speeding up brute-force kNN graph construction to remove distance computations, or even using approximate kNNs. While Euclidean distance covers the broadest range of uses, we would also like to expose other distance metrics provided in the Scikit-Learn Contrib implementation.

The scikit learn contrib implementation also contains many nice additional features not covered in the seminal paper on HDBSCAN, such as semi-supervised and fuzzy clustering. We also have solid single linkage and building blocks for optical algorithms that will be a good addition to RAPIDS in the future. Finally, we hope to support sparse input in the future.

If you find one or more of these features to make your application or data analysis project more successful, even if they are not listed here, please go to our Github project and create an issue.

generalize

HDBSCAN is a relatively new density-based clustering algorithm "standing on the shoulders of giants", improving upon the well-known DBSCAN and optical algorithms. In fact, its core primitives also increase reuse and provide building blocks for other algorithms such as graph-based minimum spanning trees and single-link clustering in RAPIDS ML and galleries.

Like other data modeling algorithms, HDBSCAN is not the perfect tool for all jobs, but it has many practical uses in industrial and scientific computing applications. It can also be used with dimensionality reduction algorithms such as PCA or UMAP, especially in exploratory data analysis applications.

About the author

Corey Nolet is a data scientist and senior engineer on the RAPIDS ML team at NVIDIA, where he focuses on building and scaling machine learning algorithms to support extreme data loads at the speed of light. Prior to working at NVIDIA, Corey spent more than a decade building large-scale exploratory data science and real-time analytics platforms for HPC environments in the defense industry. Corey holds a Bachelor of Science degree in Computer Science from the UK. He is still working on his Ph.D. Within the same discipline, algorithmic acceleration at the intersection of graphs and machine learning is studied. Corey is passionate about using data to better understand the world.

Divye Gala is a senior software engineer on the RAPIDS team at NVIDIA, developing fast and scalable algorithms for machine learning and graph analytics on GPUs. Divye holds MS and Bs. Computer Science degree. In his free time, Divi enjoys playing football and watching cricket.

Reviewing Editor: Guo Ting

Tagged:

Leave a Reply

Your email address will not be published. Required fields are marked *