I have difficulty following all of these metrics without being able to relate them to the “concepts” being represented and measured. You say:
What I take from this plot is that the gap has pretty high variance. It is not the case that every SAE substitution is kind-of-bad, but rather there are both many SAE reconstructions that are around the expectation and many reconstructions that are very bad.
But it is hard to judge what is a high variance and whether the bad reconstructions are so because of systematic error or insufficient stability of the model or something else.
The only thing that helps me get an intuition about the concepts is the table with the top 20 tokens by average KL gap. These tokens seem rare? I think it is plausible that the model doesn’t “know” much about them and that might lead to the larger errors? It’s hard to say without more information what tokens representing what concepts are affected.
This was also my hypothesis when I first looked at the table. However, I think this is mostly an illusion. The sample means for rare tokens will have very high standard errors and so it is the case that rare tokens will have both unusually high average KL gap and unusually negative average KL gap mostly. And indeed, the correlation between token frequency and KL gap is approximately 0.
I have difficulty following all of these metrics without being able to relate them to the “concepts” being represented and measured. You say:
But it is hard to judge what is a high variance and whether the bad reconstructions are so because of systematic error or insufficient stability of the model or something else.
The only thing that helps me get an intuition about the concepts is the table with the top 20 tokens by average KL gap. These tokens seem rare? I think it is plausible that the model doesn’t “know” much about them and that might lead to the larger errors? It’s hard to say without more information what tokens representing what concepts are affected.
This was also my hypothesis when I first looked at the table. However, I think this is mostly an illusion. The sample means for rare tokens will have very high standard errors and so it is the case that rare tokens will have both unusually high average KL gap and unusually negative average KL gap mostly. And indeed, the correlation between token frequency and KL gap is approximately 0.