Can I ask a couple of questions about the DR+clustering approach?
If I understand correctly, you do the clustering in a 2D space obtained with UMAP (ignore this if I am wrong). Are you sure you are not losing important information with such a low dimension? I say this because you show that one dimension is strongly correlated with style (academic vs forum/​blog) and the second may be somewhat correlated with time. I remember that an argument exists for using n-1 dimensions when looking for n clusters, although that was probably using linear DR techniques and might not apply to UMAP. But it would be interesting to check if using higher n_components (3 to 5) results in the same clustering or generates some new insight.
Another thing you could check is using GMM instead of k-means. My (limited) experience is that if the embedding dimension is low you get better results this way. But, again, I was clustering downstream of linear DR.
This is not clear from how we wrote the paper but we actually do the clustering in the full 768-dimensional space! If you look closely as the clustering plot you can see that the clusters are slightly overlapping—that would be impossible with k-means in 2D, since in that setting membership is determined by distance from the 2D centroid.
Ahh sorry! Going back to read it was pretty clear from the text. I was tricked by the figure where the embedding is presented first.
Again, good job! :)
Cool work!
Can I ask a couple of questions about the DR+clustering approach?
If I understand correctly, you do the clustering in a 2D space obtained with UMAP (ignore this if I am wrong). Are you sure you are not losing important information with such a low dimension? I say this because you show that one dimension is strongly correlated with style (academic vs forum/​blog) and the second may be somewhat correlated with time. I remember that an argument exists for using n-1 dimensions when looking for n clusters, although that was probably using linear DR techniques and might not apply to UMAP. But it would be interesting to check if using higher n_components (3 to 5) results in the same clustering or generates some new insight.
Another thing you could check is using GMM instead of k-means. My (limited) experience is that if the embedding dimension is low you get better results this way. But, again, I was clustering downstream of linear DR.
Thank you for the comment and the questions! :)
This is not clear from how we wrote the paper but we actually do the clustering in the full 768-dimensional space! If you look closely as the clustering plot you can see that the clusters are slightly overlapping—that would be impossible with k-means in 2D, since in that setting membership is determined by distance from the 2D centroid.
Ahh sorry! Going back to read it was pretty clear from the text. I was tricked by the figure where the embedding is presented first. Again, good job! :)