I’m not sure what you mean by “K-means clustering baseline (with K=1)”. I would think the K in K-means stands for the number of means you use, so with K=1, you’re just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.
If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space
I think this is what I care about finding out. If you’re right this is indeed not surprising nor an issue, but you being right would be a major departure from the current mainstream interpretability paradigm(?).
The question of regions vs compositionality is what I’ve been investigating with my mentees recently, and pretty keen on. I’ll want to write up my current thoughts on this topic sometime soon.
I’m not sure what you mean by “K-means clustering baseline (with K=1)”. I would think the K in K-means stands for the number of means you use, so with K=1, you’re just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.
Thanks for pointing this out! I confused nomenclature, will fix!
Edit: Fixed now. I confused
the number of clusters (“K”) / dictionary size
the number of latents (“L_0” or k in top-k SAEs). Some clustering methods allow you to assign multiple clusters to one point, so effectively you get a “L_0>1″ but normal KMeans is only 1 cluster per point. I confused the K of KMeans and the k (aka L_0) of top-k SAEs.
I think he messed up the lingo a bit, but looking at the code he seems to have done k-means with a number of clusters similar to the number of SAE latents, which seems fine.
I’m not sure what you mean by “K-means clustering baseline (with K=1)”. I would think the K in K-means stands for the number of means you use, so with K=1, you’re just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.
But anyway, under my current model (roughly Why I’m bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.
I think this is what I care about finding out. If you’re right this is indeed not surprising nor an issue, but you being right would be a major departure from the current mainstream interpretability paradigm(?).
The question of regions vs compositionality is what I’ve been investigating with my mentees recently, and pretty keen on. I’ll want to write up my current thoughts on this topic sometime soon.
Thanks for pointing this out! I confused nomenclature, will fix!
Edit: Fixed now. I confused
the number of clusters (“K”) / dictionary size
the number of latents (“L_0” or k in top-k SAEs). Some clustering methods allow you to assign multiple clusters to one point, so effectively you get a “L_0>1″ but normal KMeans is only 1 cluster per point. I confused the K of KMeans and the k (aka L_0) of top-k SAEs.
I think he messed up the lingo a bit, but looking at the code he seems to have done k-means with a number of clusters similar to the number of SAE latents, which seems fine.