Data scientists in different fields use clustering methods to find natural "similar" groups of observations in their datasets. Popular clustering methods can be:
Centroid-based: Groups points into k groups based on their proximity to a certain centroid.
Graph-based: Groups vertices in the graph based on their connections.
Density-based: More flexible grouping based on the density or sparsity of data in nearby areas.
Hierarchical Density-Based Application Spatial Clustering w/ Noise (HDBSCAN) algorithm is a density-based clustering method that is robust to noise (takes points in sparse regions as cluster boundaries and labels some of them directly for noise). Density-based clustering methods, such as HDBSCAN, are able to find clusters of odd shapes and sizes—as opposed to centroid-based clustering methods such as k-means, k-medioids, or Gaussian mixture models, which find a set of k centroids that model clusters as spheres of fixed shape and size. In addition to having to specify k in advance, the performance and simplicity of centroid-based algorithms help them remain one of the most popular methods for clustering points in high dimensions; even without modifying the input data points, they cannot or density of clusters to model.
Figure 1: K-means assumes that the data can be modeled with a fixed-size Gaussian sphere and slices the satellites, rather than clustering them individually. K-means assigns each point to a cluster, even in the presence of noise and outliers that can affect the centroid s of the result.
Figure 2: Density-based algorithms address this problem by expanding clusters from salient neighborhoods of dense regions. DBSCAN and HDBSCAN can interpret these points as noise and label them as noise, as the purple points in the figure.
HDBSCAN builds on a well-known density-based clustering algorithm, DBSCAN, which does not require the number of clusters to be known in advance, but still suffers from the unfortunate disadvantage of assuming that clusters can be modeled with a global density threshold. This makes it difficult to model clusters with different densities. HDBSCAN ameliorates this shortcoming by using single-chain clusters to build a dendrogram, allowing clusters of different densities to be found. Another well-known density-based clustering method is called the optical algorithm, which improves on DBSCAN and uses hierarchical clustering to find clusters with different densities. Optical techniques improve standard single-link clustering by projecting points into a new space, called the reachable space, that moves noise further away from dense regions, making it more tractable. However, like many other hierarchical agglomerative clustering methods (such as single-link clustering and fully-linked clustering), OPTICS suffers from the disadvantage of cutting the resulting dendrogram with a single global cut value. HDBSCAN is essentially optical + DBSCAN, introducing measures of cluster stability to slice dendrograms at different levels.
We will demonstrate the features currently supported in the RAPIDS cuML implementation of HDBSCAN with a quick example, and will provide some real-world examples and benchmarks of our implementation on the GPU. After reading this blog post, we hope you are excited about the benefits that RAPIDS' GPU – Accelerated HDBSCAN implementation can bring to your workflow and exploratory data analysis process.
Getting Started with HDBSCAN in RAPIDS
GPU provides a set of RAPIDS – Accelerated CPU Libraries that can almost replace many popular libraries in the PyData ecosystem. The example notebook below demonstrates API compatibility between the most widely used HDBSCAN Python library on Python and RAPIDS cuML HDBSCAN on GPU (spoiler alert – in many cases it is as simple as changing an import).
Basic Usage
Example of training an HDBSCAN model using the hdbscan Python package in Scikit-learn contrib:
from learned import datasetsfrom hdbscan import HDBSCANX, y = datasets.make_moons(n_samples=50, noise=0.05)model = HDBSCAN(min_samples=5)y_hat = model.fit_predict(X)
And the same code using the GPU-Accelerated HDBSCAN in cuML (spoiler alert: the only difference is the import).
from learned import datasetsfrom cuml.cluster import HDBSCANX, y = datasets.make_moons(n_samples=50, noise=0.05)model = HDBSCAN(min_samples=5, gen_min_span_tree=True)y_hat = model.fit_predict(X)
/share/workspace/cuml/python/cuml/internals/api_decorators.py:794: FutureWarning: Pass handle=None, verbose=False, output_type=None as keyword args. From version 21.06, passing these as positional arguments will result in an error return func(**kwargs)
Plotting
We can plot the minimum spanning tree the same way we would for the original HDBSCAN implementation:
model.minimum_spanning_tree_.plot()
The single linkage dendrogram and condensed tree can be plotted as well:
model.single_linkage_tree_.plot()
model.condensed_tree_.plot()
Below is a very simple example that demonstrates the benefits of density-based clustering over centroid-based techniques for certain types of data, and the benefits of using HDBSCAN over DBSCAN.
Clustering Algorithm Comparison
Example notebook showing the strengths of density-based clustering techniques DBSCAN & HDBSCAN on datasets with odd and interleaved shapes.
Generate the data
from learned import datasetsimport numpy as np
import warningswarnings.filterwarnings("ignore")
X, y = datasets.make_moons(n_samples=1000, noise=0.12)
import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1], c=y, s=0.5)
K-Means
Even when we know there are only 2 clusters in this dataset, we can see that k-means just separates the space into two evenly-sized regions. It’s not possible to model the interleaved and odd cluster shapes without performing some heavy pre-processing to project the points into a space where they can be modeled with fixed-sized balls.
from cuml.cluster import KMeans
kmeans_labels_ = KMeans(n_clusters=2).fit_predict(X)
plt.title("Interleaved Moons w/ K-Means")plt.xticks([])plt.yticks([])plt.scatter(X[:,0], X[:,1], c=kmeans_labels_, s=1.0)
DBSCAN
While this particular example dataset tends to be a great demonstration of how DBSCAN can find clusters of odd shapes and similar densities, theeps
parameter sets the radius of a ball around each point where points inside the ball are considered part of its neighborhood. This neighborhood is called an epsilon neighborhoods and it can be pretty non-intuitive to tune in practice. In addition, it’s assumed that the sameeps
setting applies globally to all the data points. A little bit more intuitive is themin_samples
parameter, which determines the neighborhood size within theeps
ball for a point to be considered acore point
, upon which its neighborhood will be expanded to form a cluster.
from cuml.cluster import DBSCAN
As we can see, the default paramter settings of DBSCAN (and other density-based algorithms) make assumptions about the data that won’t often lead to good clustering results.
dbscan_labels_ = DBSCAN().fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=dbscan_labels_, s=1.0)
We can set themin_samples
parameter to 10, but the amount of noise is too high for this alone to result in good cluster assignments.
dbscan_labels_ = DBSCAN(min_samples=10).fit_predict(X)
import numpy as npnp.unique(dbscan_labels_)
array([0], dtype=int32)
plt.scatter(X[:,0], X[:,1], c=dbscan_labels_, s=0.5)
We can introduce a setting for theeps
parameter, but it’s unclear how to determine a good setting for this value just from this visualization. Setting this value too small can yield too many small clusterings which are not representative of the larger patterns in the data.
dbscan_labels_ = DBSCAN(min_samples=10, eps=0.08).fit_predict(X)np.unique(dbscan_labels_)
array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=int32)
Increasingeps
to 0.12 did the trick and we notice it found the points which surround the dense regions as noise (these will have a label of -1)
dbscan_labels_ = DBSCAN(min_samples=10, eps=0.12).fit_predict(X)plt.title("Interleaved Moons w/ DBSCAN")plt.xticks([])plt.yticks([])plt.scatter(X[:,0], X[:,1], c=dbscan_labels_, s=0.5)
HDBSCAN
HDBSCAN removes the need to specify a global epsilon neighborhood radius around each point (eps
) and instead uses themin_points
argument to create a radius to its k-th nearest neighbor, using the radius to push sparser regions further away from each other. Since HDBSCAN is built upon the agglomerative clustering technique single-linkage clustering, each point starts as its own cluster and is merged into a tree, bottom-up, until a single root cluster is reached. Amin_cluster_size
argument allows us to specify a minimum threshold for when clusters in the agglomerated tree should be merged. The dendrogram is cut at varying levels, or selected, using a measure of cluster stability, based on the distances between the points in each cluster and the clusters from which they originated. Clusters which are selected cosume all of their descendent clusters. HDBSCAN provides an additional optioncluster_selection_epsilon
to set the minimum distance threshold from which clusters will be split up into smaller clusters.
from cuml.cluster import HDBSCAN
labels_ = HDBSCAN(min_samples=10).fit_predict(X)
np.unique(labels_)
array([-1, 0, 1], dtype=int32)
plt.scatter(X[:,0], X[:,1], c=labels_, s=0.5)
Adding the same clustering against the reference Python HDBSCAN implementation
from hdbscan import HDBSCAN as HDBSCANreflabels_ = HDBSCANref(min_samples=25).fit_predict(X)plt.scatter(X[:,0], X[:,1], c=labels_, s=0.5)
Application of HDBSCAN in practice
Density-based clustering techniques are naturally suitable for many different clustering tasks because they are able to find clusters of odd shapes and sizes. As with many other general-purpose machine learning algorithms, there is no free lunch, so while HDBSCAN improves upon some mature algorithms, it is still not the best tool for the job. Nonetheless, DBSCAN and HDBSCAN have achieved remarkable success in applications ranging from geospatial and collaborative filtering/recommendation systems to financial and scientific computing, being used in disciplines ranging from astronomy to accelerator physics to genomics. Its robustness to noise also makes it useful for outlier and anomaly detection applications.
As with many other tools in the data analysis and machine learning ecosystem, computation time has a large impact on production systems and iterative workflows. A faster HDBSCAN means being able to try out more ideas and make better models. Below are several example notebooks using HDBSCAN to cluster word embeddings and single-cell RNA gene expression. These are for brevity and provide a good starting point for using HDBSCAN for your own datasets. Have you successfully implemented HDBSCAN in industrial or scientific fields that we do not list here? Please leave a comment as we would love to hear. If you are running the example laptop on your own hardware, please also let us know your setup and your experience with RAPIDS.
word embedding
Vector embeddings represent a popular and very widespread machine learning application for clustering. We chose the GoogleNews dataset because it is large enough to give a good indication of the scale of our algorithm, but small enough to execute on a single machine. The following notebook demonstrates how to use HDBSCAN to find meaningful topics from high-density regions in the angular space of word embeddings, and visualize the resulting topic clusters using UMAP. It uses a subset of the entire dataset for visualization, but provides a good demonstration for tuning different hyperparameters and getting familiar with their effect on the resulting clusters. We benchmarked the entire dataset with default hyperparameter settings (shape 3Mx300) and stopped the Scikit learn contrib implementation on CPU after 24 hours. The implementation of RAPIDS takes approximately 22.8 minutes.
Clustering and Visualizing Embeddings with HDBSCAN & UMAP
In this notebook, we use a dataset of word embeddings to extract areas that could be associated with meaningful topics. We use HDBSCAN to find the meaningful topics since it can account for noise when labeling regions of high density without explicitly requiring a distance as input. We also use UMAP to visualize the resulting topic clusters.
This notebook requires theGoogleNews Word2Vec dataset, which can be downloaded fromhttps://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing.
from as a nation import modelsimport cupy as cpfrom cuml.manifold import UMAPfrom cuml.preprocessing import normalizefrom cuml.cluster import HDBSCANimport matplotlib.pyplot as pltfrom matplotlib.pyplot import figure
w = models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
We can usemin_samples
to control the radius of the core distances. A smaller setting leads to more clusters and less points being treated as noise.min_cluster_size
andmax_cluster_size
bound the number of clusters by only allowing points to form a cluster when they are >min_cluster_cluster
and splitting a single cluster into multiple smaller clusters when it is >max_cluster_size
n_points = 100000umap_n_neighbors=100umap_min_dist=1e-3umap_spread=2.0head_n_epochs=500umap_random_state=42hdbscan_min_samples=25hdbscan_min_cluster_size=5hdbscan_max_cluster_size=1000hdbscan_cluster_selection_method="leaf"
L2 normalize the vectors so that the Euclidean distance between them will give the angular distance.
%%timeX = normalize(cp.array(w.vectors))[:n_points]
CPU times: user 4.17 s, sys: 4.57 s, total: 8.73 sWall time: 8.76 s
X.shape
(100000, 300)
First, we can use manifold learning to convert the neighborhoods of points in the angular distance space to a 2-dimensional Euclidean space so that we can better virualize the clusters.
%%timeumap = UMAP(n_neighbors=umap_n_neighbors, min_dist=umap_min_dist, spread=umap_spread, n_epochs=head_n_epochs, random_state=umap_random_state)embeddings = umap.fit_transform(X)
CPU times: user 1.71 s, sys: 910 ms, total: 2.62 sWall time: 2.62 s
%%timehdbscan = HDBSCAN(min_samples=hdbscan_min_samples, min_cluster_size=hdbscan_min_cluster_size, max_cluster_size=hdbscan_max_cluster_size, cluster_selection_method=hdbscan_cluster_selection_method)labels = hdbscan.fit_predict(X)
Label prop iterations: 23Label prop iterations: 10Label prop iterations: 6Label prop iterations: 5Iterations: 417663,114,1077,12,218,1603CPU times: user 5.6 s, sys: 933 ms, total: 6.53 sWall time: 6.51 s
Percentage of points were labeled as meaningful topics
len(labels[labels!=-1]) / embeddings.shape[0]
0.04558
How many labels are there?
len(cp.unique(labels))
151
x = embeddings[:,0]y = embeddings[:,1]x = x[labels>-1]y = y[labels>-1]labels_nonoise = labels[labels>-1]x_noise = embeddings[:,0]y_noise = embeddings[:,1]x_noise = x_noise[labels==-1]y_noise = y_noise[labels==-1]
We observe that the regions of high density where points are collected closely together have filled out nicely into clusters of various different shapes and sizes. We interpret the distances of points within their local neighborhoods in the following plot as being more similar than points in disconnected neighborhoods. It’s not uncommon for UMAP to split neighborhoods that into multiple clusters even though HDBSCAN labeled them as a single cluster. This just implies that the neighborhoods of the points which are split are more similar to each other internally than they are across their multiple separated components.
figure(figsize=(10, 7), dpi=80)plt.scatter(x.get(), y.get(), c=labels_nonoise.get(), s=0.1, cmap='gist_ncar')
We can also visualize the noise
figure(figsize=(10, 7), dpi=80)plt.scatter(x_noise.get(), y_noise.get(), s=0.001, cmap="gray", alpha=0.4)
label_dist = cp.histogram(labels, bins=cp.unique(labels))plt.plot(label_dist[1][:len(label_dist[0])-1].get(), label_dist[0][1:].get())
[]
Let’s look at a few groups of terms that clustered together
for c in range(0, 10): print("Cluster:%s,%s" % (c, [w.index_to_key[i] for i in cp.argwhere(labels==c)[:10,0].get()]))
Cluster: 0, ['##th', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', '###th', '##st', '##nd']Cluster: 1, ['miles', 'north', 'south', 'west', 'east', 'kilometers', 'km', 'northwest', 'northeast', 'southwest']Cluster: 2, ['petitioner', 'Defendant', 'Plaintiff', 'Appellant', 'Petitioner', 'appellants', 'Id.']Cluster: 3, ['memorial', 'cemetery', 'burial', 'graves', 'remembrance']Cluster: 4, ['FPL', 'AEP', 'Progress_Energy', 'TXU', 'FirstEnergy', 'Ameren', 'ComEd', 'NIPSCO']Cluster: 5, ['Officials', 'Students', 'Residents', 'Customers', 'Businesses', 'Activists']Cluster: 6, ['gay', 'gay_marriage', 'gays', 'lesbian', 'homosexual', 'LGBT', 'lesbians', 'homosexuals']Cluster: 7, ['borrowings', 'Common_Stock', 'Common_Shares', 'Senior_Notes', 'senior_unsecured', 'debentures', 'convertible_notes', 'Preferred_Stock', 'Debentures', 'convertible_debentures']Cluster: 8, ['-#.#', '-1', '-2', '-3', '-4']Cluster: 9, ['Supreme_Court', 'justices', 'Alito', 'appellate_court', 'Samuel_Alito']
We can also order them according to their probabilities (notice most probabilities are fairly large because the regions are already pretty dense)
for c in range(0, 25): indices = list(cp.argwhere(labels==c).get()[:,0]) sorted(indices, key=lambda x: 1 - hdbscan.probabilities_[x]) print("Cluster:%s,%s" % (c, [w.index_to_key[i] for i in indices[:10]]))
Cluster: 0, ['##th', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', '###th', '##st', '##nd']Cluster: 1, ['miles', 'north', 'south', 'west', 'east', 'kilometers', 'km', 'northwest', 'northeast', 'southwest']Cluster: 2, ['petitioner', 'Defendant', 'Plaintiff', 'Appellant', 'Petitioner', 'appellants', 'Id.']Cluster: 3, ['memorial', 'cemetery', 'burial', 'graves', 'remembrance']Cluster: 4, ['FPL', 'AEP', 'Progress_Energy', 'TXU', 'FirstEnergy', 'Ameren', 'ComEd', 'NIPSCO']Cluster: 5, ['Officials', 'Students', 'Residents', 'Customers', 'Businesses', 'Activists']Cluster: 6, ['gay', 'gay_marriage', 'gays', 'lesbian', 'homosexual', 'LGBT', 'lesbians', 'homosexuals']Cluster: 7, ['borrowings', 'Common_Stock', 'Common_Shares', 'Senior_Notes', 'senior_unsecured', 'debentures', 'convertible_notes', 'Preferred_Stock', 'Debentures', 'convertible_debentures']Cluster: 8, ['-#.#', '-1', '-2', '-3', '-4']Cluster: 9, ['Supreme_Court', 'justices', 'Alito', 'appellate_court', 'Samuel_Alito']Cluster: 10, ['Citigroup', 'Goldman_Sachs', 'Morgan_Stanley', 'UBS', 'Merrill_Lynch', 'Deutsche_Bank', 'HSBC', 'Credit_Suisse', 'JPMorgan', 'JP_Morgan']Cluster: 11, ['CEO', 'chairman', 'vice_president', 'Director', 'Vice_President', 'managing_director', 'Executive_Director', 'Managing_Director', 'Senior_Vice', 'Executive_Vice']Cluster: 12, ['Wal_Mart', 'Walmart', 'Kmart', 'Kroger', 'Costco', 'Nordstrom', 'JC_Penney']Cluster: 13, ['drilling', 'mineralization', 'intersected', 'g_t_Au', 'drill_holes', 'g_t_gold', 'g_t', 'mineralized', 'molybdenum', 'gold_mineralization']Cluster: 14, ['JB', 'JM', 'ML', 'JS', 'JL', 'JH']Cluster: 15, ['Warner_Bros.', 'Lionsgate', 'DreamWorks', 'Paramount_Pictures', 'Sony_Pictures']Cluster: 16, ['artist', 'painting', 'paintings', 'painter', 'sculptures', 'artworks', 'watercolors', 'Paintings', 'canvases']Cluster: 17, ['loans', 'mortgage', 'lenders', 'mortgages', 'borrowers', 'foreclosures', 'subprime', 'mortgage_lenders']Cluster: 18, ['plane', 'aircraft', 'planes', 'airplane', 'jets', 'Airbus_A###', 'aircrafts']Cluster: 19, ['apartments', 'condo', 'condominium', 'condos', 'Apartments', 'condominiums', 'townhouse', 'townhomes', 'Condominiums', 'townhome']Cluster: 20, ['fell', 'jumped', 'climbed', 'slipped', 'risen', 'surged', 'plunged', 'slid', 'soared', 'dipped']Cluster: 21, ['school', 'students', 'teachers', 'elementary', 'Elementary', 'kindergarten', 'preschool', 'elementary_schools', 'eighth_grade', 'eighth_graders']Cluster: 22, ['professor', 'Professor', 'scientist', 'researcher', 'Ph.D.', 'associate_professor', 'master_degree', 'PhD', 'assistant_professor', 'bachelor_degree']Cluster: 23, ['gun', 'firearm', 'rifle', 'handgun', 'pistol', 'pistols', 'assault_rifle', '.##_caliber', '.##_caliber_handgun', 'gauge_shotgun']Cluster: 24, ['#.####', 'yen', 'greenback', 'EUR_USD', 'downtrend', 'USD_JPY', '#.####/##', 'GBP_USD', 'EURUSD', 'Japanese_Yen']
single cell RNA
Below is an example workflow based on the tutorial notebooks in the Scan and Club library. This example notebook is taken from the RAPIDS single-cell examples repository, which also contains several notebooks that demonstrate the use of RAPIDS for single-cell and tertiary analysis. On DGX-1 (Intel 40 core Xeon CPU + NVIDIA V100 GPU), we found that HDBSCAN (~1s on GPU instead of ~29s with multiple CPU threads) used a gene expression containing ~70k lung cells The first 50 principal components on the dataset, with a 29x speedup.
RAPIDS & Scanpy Single-Cell RNA-seq Workflow
Copyright (c) 2020, NVIDIA CORPORATION.
Licensed under the Apache License, Version 2.0 (the “License”) you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook demonstrates a single-cell RNA analysis workflow that begins with preprocessing a count matrix of size(n_gene, n_cell)
and results in a visualization of the clustered cells for further analysis.
For demonstration purposes, we use a dataset of ~70,000 human lung cells from Travaglini et al. 2020 (https://www.biorxiv.org/content/10.1101/742320v2) and label cells using the ACE2 and TMPRSS2 genes. See the README for instructions to download this dataset.
Import requirements
import numpy as npimport scanpy as scimport anndataimport timeimport osimport cudfimport cupy as cpfrom cuml.decomposition import PCAfrom cuml.manifold import TSNEfrom cuml.cluster import KMeansimport rapids_scanpy_funcsimport warningswarnings.filterwarnings('ignore', 'Expected ')warnings.simplefilter('ignore')
We use the RAPIDS memory manager on the GPU to control how memory is allocated.
import rmmrmm.reinitialize( managed_memory=True, # Allows oversubscription pool_allocator=False, # default is False devices=0, # GPU device IDs to register. By default registers only GPU 0.)cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
Input data
In the cell below, we provide the path to the.h5ad
file containing the count matrix to analyze. Please see the README for instructions on how to download the dataset we use here.
We recommend saving count matrices in the sparse .h5ad format as it is much faster to load than a dense CSV file. To run this notebook using your own dataset, please see the README for instructions to convert your own count matrix into this format. Then, replace the path in the cell below with the path to your generated.h5ad
file.
input_file = "../data/krasnow_hlca_10x.sparse.h5ad"if not os.path.exists(input_file): print('Downloading import file...') os.makedirs('../data', exist_ok=True) wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/krasnow_hlca_10x.sparse.h5ad', input_file)
Set parameters
# marker genesRIBO_GENE_PREFIX = "RPS" # Prefix for ribosomal genes to regress outmarkers = ["ACE2", "TMPRSS2", "EPCAM"] # Marker genes for visualization# filtering cellsmin_genes_per_cell = 200 # Filter out cells with fewer genes than this expressedmax_genes_per_cell = 6000 # Filter out cells with more genes than this expressed# filtering genesn_top_genes = 5000 # Number of highly variable genes to retain# PCAn_components = 50 # Number of principal components to compute# t-SNEtsne_n_pcs = 20 # Number of principal components to use for t-SNE# k-meansk = 35 # Number of clusters for k-means# KNNn_neighbors = 15 # Number of nearest neighbors for KNN graphknn_n_pcs = 50 # Number of principal components to use for finding nearest neighbors# UMAPumap_min_dist = 0.3 umap_spread = 1.0# Gene rankingranking_n_top_genes = 50 # Number of differential genes to compute for each cluster
start = time.time()
Load and Prepare Data
We load the sparse count matrix from anh5ad
file using Scanpy. The sparse count matrix will then be placed on the GPU.
data_load_start = time.time()
%%timeneedle = sc.read(input_file)
CPU times: user 125 ms, sys: 228 ms, total: 354 msWall time: 353 ms
needle.X.shape
(65662, 26485)
a = np.diff(needle.X.indptr)
a
array([1347, 1713, 1185, ..., 651, 1050, 2218], dtype=int32)
len(a[a3000])
52985
We maintain the index of unique genes in our dataset:
%%timegenes = cudf.Series(needle.var_names)sparse_gpu_array = cp.sparse.csr_matrix(needle.X)
CPU times: user 836 ms, sys: 586 ms, total: 1.42 sWall time: 1.45 s
Verify the shape of the resulting sparse matrix:
sparse_gpu_array.shape
(65662, 26485)
And the number of non-zero values in the matrix:
sparse_gpu_array.nnz
126510394
data_load_time = time.time()print("Total data load and format time:%s" % (data_load_time-data_load_start))
Total data load and format time: 1.8720729351043701
Preprocessing
preprocess_start = time.time()
Filter
We filter the count matrix to remove cells with an extreme number of genes expressed.
%%timesparse_gpu_array = rapids_scanpy_funcs.filter_cells(sparse_gpu_array, min_genes=min_genes_per_cell, max_genes=max_genes_per_cell)
CPU times: user 535 ms, sys: 304 ms, total: 839 msWall time: 838 ms
Some genes will now have zero expression in all cells. We filter out such genes.
%%timesparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1)
CPU times: user 1.03 s, sys: 162 ms, total: 1.19 sWall time: 1.19 s
The size of our count matrix is now reduced.
sparse_gpu_array.shape
(65462, 22058)
Normalize
We normalize the count matrix so that the total counts in each cell sum to 1e4.
%%timesparse_gpu_array = rapids_scanpy_funcs.normalize_total(sparse_gpu_array, target_sum=1e4)
CPU times: user 415 µs, sys: 599 µs, total: 1.01 msWall time: 747 µs
Next, we log transform the count matrix.
%%timesparse_gpu_array = sparse_gpu_array.log1p()
CPU times: user 42.7 ms, sys: 52.4 ms, total: 95.1 msWall time: 94.4 ms
Select Most Variable Genes
We will now select the most variable genes in the dataset. However, we first save the ‘raw’ expression values of the ACE2 and TMPRSS2 genes to use for labeling cells afterward. We will also store the expression of an epithelial marker gene (EPCAM).
%%timetmp_norm = sparse_gpu_array.tocsc()marker_genes_raw = { ("%s_raw" % marker): tmp_norm[:, genes[genes == marker].index[0]].todense().ravel() for marker in markers}of tmp_norm
CPU times: user 208 ms, sys: 175 ms, total: 383 msWall time: 383 ms
Now, we convert the count matrix to an annData object.
%%timeneedle = anndata.AnnData(sparse_gpu_array.get())needle.var_names = genes.to_pandas()
CPU times: user 178 ms, sys: 52.9 ms, total: 231 msWall time: 229 ms
Using scanpy, we filter the count matrix to retain only the 5000 most variable genes.
%%timesc.pp.highly_variable_genes(needle, n_top_genes=n_top_genes, flavor="cell_ranger")needle = needle[:, needle.was.highly_variable]
CPU times: user 977 ms, sys: 26.5 ms, total: 1 sWall time: 1 s
Regress out confounding factors (number of counts, ribosomal gene expression)
We can now perform regression on the count matrix to correct for confounding factors – for example purposes, we use the number of counts and the expression of ribosomal genes. Many workflows use the expression of mitochondrial genes (named starting withMT-
).
ribo_genes = needle.var_names.str.startswith(RIBO_GENE_PREFIX)
We now calculate the total counts and the percentage of ribosomal counts for each cell.
%%timen_counts = needle.X.sum(axis=1)percent_ribo = (needle.X[:,ribo_genes].sum(axis=1) / n_counts).ravel()n_counts = cp.array(n_counts).ravel()percent_ribo = cp.array(percent_ribo).ravel()
CPU times: user 881 ms, sys: 32.3 ms, total: 914 msWall time: 913 ms
And perform regression:
%%timesparse_gpu_array = cp.sparse.csc_matrix(needle.X)
CPU times: user 653 ms, sys: 114 ms, total: 767 msWall time: 766 ms
%%timesparse_gpu_array = rapids_scanpy_funcs.regress_out(sparse_gpu_array, n_counts, percent_ribo)
CPU times: user 31.8 s, sys: 10.5 s, total: 42.3 sWall time: 43.2 s
Scale
Finally, we scale the count matrix to obtain a z-score and apply a cutoff value of 10 standard deviations, obtaining the preprocessed count matrix.
%%timesparse_gpu_array = rapids_scanpy_funcs.scale(sparse_gpu_array, max_value=10)
CPU times: user 119 ms, sys: 169 ms, total: 289 msWall time: 287 ms
preprocess_time = time.time()print("Total Preprocessing time:%s" % (preprocess_time-preprocess_start))
Total Preprocessing time: 48.95587778091431
Cluster & Visualize
We store the preprocessed count matrix as an AnnData object, which is currently in host memory. We also add the expression levels of the marker genes as observations to the annData object.
%%timegenes = needle.var_namesneedle = anndata.AnnData(sparse_gpu_array.get())needle.var_names = genesfor name, data in marker_genes_raw.items(): needle.obs[name] = data.get()
CPU times: user 199 ms, sys: 92.1 ms, total: 292 msWall time: 291 ms
Reduce
We use PCA to reduce the dimensionality of the matrix to its top 50 principal components.
%%timeneedle.obsm["X_pca"] = PCA(n_components=n_components, output_type="numpy").fit_transform(needle.X)
CPU times: user 1.14 s, sys: 1.41 s, total: 2.55 sWall time: 2.55 s
UMAP + Density clustering
We can also visualize the cells using the UMAP algorithm in Rapids. Before UMAP, we need to construct a k-nearest neighbors graph in which each cell is connected to its nearest neighbors. This can be done conveniently using rapids functionality already integrated into Scanpy.
Note that Scanpy uses an approximation to the nearest neighbors on the CPU while the GPU version performs an exact search. While both methods are known to yield useful results, some differences in the resulting visualization and clusters can be observed.
%%timesc.pp.neighbors(needle, n_neighbors=n_neighbors, n_pcs=knn_n_pcs, method='rapids')
CPU times: user 4.04 s, sys: 276 ms, total: 4.32 sWall time: 4.3 s
The UMAP function from Rapids is also integrated into Scanpy.
%%timesc.tl.umap(needle, min_dist=umap_min_dist, spread=umap_spread, method='rapids')
WARNING: .obsp["connectivities"] have not been computed using umapCPU times: user 246 ms, sys: 229 ms, total: 475 msWall time: 474 ms
Next, we use the Louvain algorithm for graph-based clustering, once again using therapids
option in Scanpy.
%%timeimport pandas as pdfrom cuml.cluster import HDBSCANhdbscan = HDBSCAN(min_samples=5, min_cluster_size=30)needle.obs['hdbscan_gpu'] = pd.Categorical(pd.Series(hdbscan.fit_predict(needle.obsm['X_pca'])))
CPU times: user 718 ms, sys: 465 ms, total: 1.18 sWall time: 1.19 s
%%timeimport pandas as pdfrom hdbscan import HDBSCAN as refHDBSCANhdbscan = refHDBSCAN(min_samples=5, min_cluster_size=30, core_dist_n_jobs=-1)needle.obs['hdbscan_cpu'] = pd.Categorical(pd.Series(hdbscan.fit_predict(needle.obsm['X_pca'])))
CPU times: user 23.5 s, sys: 1.2 s, total: 24.7 sWall time: 29.6 s
We plot the cells using the UMAP visualization, and using the Louvain clusters as labels.
%%timesc.pl.umap(needle, color=["hdbscan_gpu"])
CPU times: user 597 ms, sys: 12.2 ms, total: 609 msWall time: 607 ms
The RAPIDS CUML project includes end-to-end GPU-accelerated HDBSCAN and provides Python and C++ APIs. Like many neighborhood-based algorithms in cuML, it leverages the brute-force kNN from Facebook's Fisku to accelerate the construction of kNN graphs in mutually reachable spaces. This is currently a major bottleneck and we are investigating ways to further improve it with exact and approximate nearest neighbor options.
CUML also includes an implementation of monolinkage hierarchical clustering, which provides C++ and Python APIs. GPU – Accelerating a single link algorithm requires a new primitive to compute the minimum spanning tree. This primitive is based on graphs, so it can be reused in the cugraph and cuml libraries. Our implementation allows restarting so we can connect a disconnected knn graph and improves scalability by not having to store the entire pairwise distance matrix in GPU memory.
As with C++ in most CUML algorithms, these rely on our most ML-based and primitive-based [VZX27]. In the end, they took advantage of the great work done by Leland McInnes and John Healey to the GPU – even speeding up the cluster compression and selection steps, keeping as much data as possible on the GPU, and scaling at data scale Provides an additional performance boost to the millions.
benchmark
We used the benchmark notebook provided by the reference implementation on CPU by McInnes et al. to compare it with the new GPU implementation of cuML. The reference implementation is highly optimized for the low-dimensional case, and we compared the high-dimensional case with a brute force implementation that heavily uses Facebook's FAISS library.
Figure 4: GPU-accelerated HDBSCAN maintains near-interactive performance even for large datasets while eliminating CPU-limited parallelism. With these speeds, you'll find that you have more time for other things, like better understanding your data.
Benchmarks were performed on a DGX-1 containing 40-core Intel Xeon CPUs and NVIDIA 32gb V100 GPUs. Even with linear scaling of the number of dimensions and quadratic scaling of the number of rows, we observe that the GPU still maintains close to interactive performance, even when the number of rows exceeds 1M.
What has changed?
While we have successfully implemented the core of the HDBSCAN algorithm on GPU, there are still opportunities to further improve its performance, such as by speeding up brute-force kNN graph construction to remove distance computations, or even using approximate kNNs. While Euclidean distance covers the broadest range of uses, we would also like to expose other distance metrics provided in the Scikit-Learn Contrib implementation.
The scikit learn contrib implementation also contains many nice additional features not covered in the seminal paper on HDBSCAN, such as semi-supervised and fuzzy clustering. We also have solid single linkage and building blocks for optical algorithms that will be a good addition to RAPIDS in the future. Finally, we hope to support sparse input in the future.
If you find one or more of these features to make your application or data analysis project more successful, even if they are not listed here, please go to our Github project and create an issue.
generalize
HDBSCAN is a relatively new density-based clustering algorithm "standing on the shoulders of giants", improving upon the well-known DBSCAN and optical algorithms. In fact, its core primitives also increase reuse and provide building blocks for other algorithms such as graph-based minimum spanning trees and single-link clustering in RAPIDS ML and galleries.
Like other data modeling algorithms, HDBSCAN is not the perfect tool for all jobs, but it has many practical uses in industrial and scientific computing applications. It can also be used with dimensionality reduction algorithms such as PCA or UMAP, especially in exploratory data analysis applications.
About the author
Corey Nolet is a data scientist and senior engineer on the RAPIDS ML team at NVIDIA, where he focuses on building and scaling machine learning algorithms to support extreme data loads at the speed of light. Prior to working at NVIDIA, Corey spent more than a decade building large-scale exploratory data science and real-time analytics platforms for HPC environments in the defense industry. Corey holds a Bachelor of Science degree in Computer Science from the UK. He is still working on his Ph.D. Within the same discipline, algorithmic acceleration at the intersection of graphs and machine learning is studied. Corey is passionate about using data to better understand the world.
Divye Gala is a senior software engineer on the RAPIDS team at NVIDIA, developing fast and scalable algorithms for machine learning and graph analytics on GPUs. Divye holds MS and Bs. Computer Science degree. In his free time, Divi enjoys playing football and watching cricket.
Reviewing Editor: Guo Ting