I have used HDBSCAN in a variety of instances in my data science career. The noise-aware aspect is definitely a mixed blessing. Often I find the best results come from using a variety of clustering algorithms, and figuring out how to do an ensemble of the results (e.g. treating the output of each clustering algorithm as a dimension in a similarity vector). Did you experiment with other clustering algorithms also?
I have used HDBSCAN in a variety of instances in my data science career. The noise-aware aspect is definitely a mixed blessing. Often I find the best results come from using a variety of clustering algorithms, and figuring out how to do an ensemble of the results (e.g. treating the output of each clustering algorithm as a dimension in a similarity vector). Did you experiment with other clustering algorithms also?
Additionally, UMAP is outdated, please use PaCMAP instead: https://www.lesswrong.com/posts/C8LZ3DW697xcpPaqC/the-geometry-of-feelings-and-nonsense-in-large-language?commentId=Deddnyr7zJMwmNLBS
Hey, thanks for the reply. Yes, we tried k-means and agglomerative clustering and they worked with some mixed results.
We’ll try PaCMAP instead and see if it is better!