I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.
Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
Reconstruction loss is the CE loss of the patched model
If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
measures how replacing activations changes the total loss of the model
Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
Eg, I think for these to be the same it would need to be the case that something along the lines
DKL(A||B)−DKL(C||B)=DKL(A||C)
or
DKL(A||B)/DKL(C||B)=DKL(A||C)
were true, but I don’t think either of those are true. To connect that to this specific case, have B be the data distribution, and A and C the model with and without replaced activations
Reconstruction score
on a separate note that could also be a crux,
measures how replacing activations changes the total loss of the model
quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
let:
Loriginal be the CE loss of the model unperturbed on the data distribution
Lreconstructed be the CE loss of the model when activations are replaced with the reconstructed activations
Lzero be the CE loss of the model when activations are replaced with the zero vector
so, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
It’s not clear that normalizing with Lzero makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.
I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.
Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
which I still claim is equivalent.
Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
Eg, I think for these to be the same it would need to be the case that something along the lines
DKL(A||B)−DKL(C||B)=DKL(A||C)or
DKL(A||B)/DKL(C||B)=DKL(A||C)were true, but I don’t think either of those are true. To connect that to this specific case, have B be the data distribution, and A and C the model with and without replaced activations
Reconstruction score
on a separate note that could also be a crux,
quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
let:
Loriginal be the CE loss of the model unperturbed on the data distribution
Lreconstructed be the CE loss of the model when activations are replaced with the reconstructed activations
Lzero be the CE loss of the model when activations are replaced with the zero vector
then
reconstruction score =Lzero−LreconstructedLzero−Loriginalso, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
It’s not clear that normalizing with Lzero makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.