Fast Multicore HDBSCAN

The fast_hdbscan library provides an implementation of the HDBSCAN clustering algorithm designed specifically for high performance on multicore machine. The algorithm runs in parallel and can make effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even modern multicore laptops.

This library provides a re-implementation of a subset of the HDBSCAN algorithm that is compatible with the hdbscan library. There are specific optimizationsfor data that is Euclidean and low dimensional, other distance metrics and high dimensional data fallback to alternative parallel approaches that are faster than the hdbscan library, but not necessarily as performant as the highly optimized low-dimensional Euclidean case. The primary advantages of this library over the standard hdbscan library are:

this library can easily use all available cores to speed up computation;

this library has much faster implementations of tree condensing and cluster extraction;

this library is much simpler and more approachable for extending or using components from;

this library is built on numba and has less issues with binaries and compilation.

this library provides features such as semi-supervision, linking constraints, sample weights, and branch detection from FLASC, and an implemntation of PLSCAN.

This library does not support all the features and input formats available in the hdbscan library, but covers the most common use cases.

This library does support a number of research extensions to HDBSCAN including branch detection from FLASC and the semi-supervised clustering methods, as well as support for sample weights.

As a bonus this library also provides an easy to use implementation of the PLSCAN algorithm for automated cluster resolution selection and layered clustering.

Basic Usage

The fast_hdbscan library follows the hdbscan library in using the sklearn API. You can use the fast_hdbscan class HDBSCAN exactly as you wuld that of the hdbscan library with the caveat that fast_hdbscan only supports a subset of the parameters and options of hdbscan. Nonetheless, if you have low-dimensional Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for hdbscan:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Installation

fast_hdbscan requires:

numba

numpy

scikit-learn

if you need more than just Euclidean distance, or support for high dimensional data, you will also need:

pynndescent

fast_hdbscan can be installed via pip:

pip install fast_hdbscan

References

The algorithm used here is an adaptation of the algorithms described in the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

The branch-detection functionality is adapted from:

D.M. Bot, J. Peeters, J. Liesenborgs, J. Aerts. FLASC: a flare-sensitive clustering algorithm. In: PeerJ Computer Science, Volume 11, e2792, 2025. https://doi.org/10.7717/peerj-cs.2792.

The PLSCAN functionality is adapted from:

D.M. Bot, L. McInnes, J. Aerts. Persistent Multiscale Density-based Clustering. In: arXiv preprint arXiv:2512.16558, 2025. https://arxiv.org/abs/2512.16558.

Fast Multicore HDBSCAN

Basic Usage

Installation

User Guide

References

Indices and tables