The popular well-known similarity/distance metrics and clustering algorithms are not nearly as good as the best ones. I think it’d be interesting to see what the results look like using some better, newer, less-known metrics.
Encyclopedia of Distances—in case you just can’t get enough, and want a whole book of distance measures!
I don’t actually know if any of these would perform better, or how they rank relative to each other for this purpose. Just wanted to give some starting points.
The popular well-known similarity/distance metrics and clustering algorithms are not nearly as good as the best ones. I think it’d be interesting to see what the results look like using some better, newer, less-known metrics.
Examples:
PaCMAP—a better UMAP
DIEM—better cosine similarity. video explanation
cosine similarity with cut initialization—a better cosine similarity
Technique for Order Performance by Similarity to Ideal Solution (TOPSIS) - another better cosine similarity
TS-SS Similarity—yet another better cosine similarity
Vector Space Model
Fusion-based semantic similarity
Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization
Comparing in context: Improving cosine similarity measures with a metric tensor
Improved sqrt-cosine similarity measurement
Improved Heterogeneous Distance Functions
Encyclopedia of Distances—in case you just can’t get enough, and want a whole book of distance measures!
I don’t actually know if any of these would perform better, or how they rank relative to each other for this purpose. Just wanted to give some starting points.
In case you want to google for ‘a better version of x technique’, here’s a list of a bunch of older techniques: https://rapidfork.medium.com/various-similarity-metrics-for-vector-data-and-language-embeddings-23a745f7f5a7