Nice post, would be great to understand what’s going on here!
Minor comment unrelated to your main points:
Conceptually, loss recovered seems a worse metric than KL divergence. Faithful reconstructions should preserve all token probabilities, but loss only compares the probabilities for the true next token
I don’t think it’s clear we want SAEs to be that faithful, for similar reasons as briefly mentioned here and in the comments of that post. The question is whether differences in the distribution are “interesting behavior” that we want to explain or whether we should think of them as basically random noise that we’re better off ignoring. If the unperturbed model assigns substantially higher probability to the correct token than after an SAE reconstruction, then it’s a good guess that this is “interesting behavior”. But if there are just differences on other random tokens, that seems less clear. That said, I’m kind of torn on this and do agree we might want to explain cases where the model is confidently wrong, and the SAE reconstruction significantly changes the way it’s wrong.
KL as a metric makes a good tradeoff here by mostly ignoring changes to tokens the original model treated as low probability (as opposed to measuring something more cursed like log prob L2 distance) and so I think captures the more interesting differences.
This motivates having good baselines to determine what this noise floor should be.
Nice post, would be great to understand what’s going on here!
Minor comment unrelated to your main points:
I don’t think it’s clear we want SAEs to be that faithful, for similar reasons as briefly mentioned here and in the comments of that post. The question is whether differences in the distribution are “interesting behavior” that we want to explain or whether we should think of them as basically random noise that we’re better off ignoring. If the unperturbed model assigns substantially higher probability to the correct token than after an SAE reconstruction, then it’s a good guess that this is “interesting behavior”. But if there are just differences on other random tokens, that seems less clear. That said, I’m kind of torn on this and do agree we might want to explain cases where the model is confidently wrong, and the SAE reconstruction significantly changes the way it’s wrong.
Yes this a good consideration. I think
KL as a metric makes a good tradeoff here by mostly ignoring changes to tokens the original model treated as low probability (as opposed to measuring something more cursed like log prob L2 distance) and so I think captures the more interesting differences.
This motivates having good baselines to determine what this noise floor should be.