Basic Usage
You can use fast_hdbscan
as a drop in replacement for hdbscan
for any basic usage of hdbscan
on data with Euclidean distances. That is, fast_hdbscan
uses the exact same interface as hdbscan
and a subset of the parameters – so whenever you are using hdbscan
with the only parameters that fast_hdbscan
supports then you can simply swap them. Let’s load libraries and grab some data to demonstrate:
[1]:
import numpy as np
import fast_hdbscan
import requests
from io import BytesIO
import seaborn as sns
sns.set(rc={"figure.figsize":(8,8)})
data_request = requests.get(
"https://github.com/scikit-learn-contrib/hdbscan/blob/master/notebooks/clusterable_data.npy?raw=true"
)
data = np.load(BytesIO(data_request.content))
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Before we do any clustering lets plot the data so we can see what we are trying to cluster:
[2]:
sns.scatterplot(x=data.T[0], y=data.T[1], alpha=0.5)
[2]:
<AxesSubplot: >
To use fast_hdbscan
you can apply it exactly as you would hdbscan
. In this case, following the approach taken by hdbscan
in its documentation, we’ll use a min_cluster_size
of 15 and call fit_predict
on the data we wish to cluster.
[3]:
cluster_labels = fast_hdbscan.HDBSCAN(min_cluster_size=15).fit_predict(data)
We can then plot the results to see what we are, in fact, performing a standard HDBSCAN clustering – just with full multicore support.
[4]:
sns.scatterplot(
x=data.T[0],
y=data.T[1],
alpha=0.5,
hue=cluster_labels,
style=cluster_labels<0,
size=cluster_labels<0,
palette="tab10",
legend=False
)
[4]:
<AxesSubplot: >
And that’s all there is to it in terms of getting started with fast_hdbscan
. Do note, however, that if you want to use hdbscan
features such as support for precomputed distances matrics, metrics other than Euclidean, prediction or soft clustering etc. then you’ll need to fall back to the hdbscan
library. The goal of this library is to run on the common case (large volumes of low dimensional Euclidean data) as fast as possible with full multicore support.