Patch loss is different to L2. It’s the KL Divergence between the normal model and the model when you patch in the reconstructed activations at some layer.
Oh I see. I’ll have to look into that cuz I used the AI-safety-foundation’s implementation and they don’t measure the KL divergence. That said, there is a validation metric called reconstruction score that measures how replacing activations change the total loss of the model, and the scores are pretty similar for the original and normalized.
I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.
Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
Reconstruction loss is the CE loss of the patched model
If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
measures how replacing activations changes the total loss of the model
Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
Eg, I think for these to be the same it would need to be the case that something along the lines
DKL(A||B)−DKL(C||B)=DKL(A||C)
or
DKL(A||B)/DKL(C||B)=DKL(A||C)
were true, but I don’t think either of those are true. To connect that to this specific case, have B be the data distribution, and A and C the model with and without replaced activations
Reconstruction score
on a separate note that could also be a crux,
measures how replacing activations changes the total loss of the model
quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
let:
Loriginal be the CE loss of the model unperturbed on the data distribution
Lreconstructed be the CE loss of the model when activations are replaced with the reconstructed activations
Lzero be the CE loss of the model when activations are replaced with the zero vector
so, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
It’s not clear that normalizing with Lzero makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.
I don’t think Lreconstruction is very informative here, as it’s highly impacted by the input batch. Both the raw Lreconstruction and Lclean have large variances at different verification steps, and since we mainly care about how good our reconstruction is compared with the original, I think the reconstruction score is good as is. I also don’t follow why the noisiness of L0 leads to showing Lreconstruction.
You can treat figure 7 as comparing the L0, and Figure 13 as comparing L2.
Patch loss is different to L2. It’s the KL Divergence between the normal model and the model when you patch in the reconstructed activations at some layer.
Oh I see. I’ll have to look into that cuz I used the AI-safety-foundation’s implementation and they don’t measure the KL divergence. That said, there is a validation metric called reconstruction score that measures how replacing activations change the total loss of the model, and the scores are pretty similar for the original and normalized.
That’s equivalent to the KL metric. Would be good to include as I think it’s the most important metric of performance.
I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.
Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
which I still claim is equivalent.
Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
Eg, I think for these to be the same it would need to be the case that something along the lines
DKL(A||B)−DKL(C||B)=DKL(A||C)or
DKL(A||B)/DKL(C||B)=DKL(A||C)were true, but I don’t think either of those are true. To connect that to this specific case, have B be the data distribution, and A and C the model with and without replaced activations
Reconstruction score
on a separate note that could also be a crux,
quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
let:
Loriginal be the CE loss of the model unperturbed on the data distribution
Lreconstructed be the CE loss of the model when activations are replaced with the reconstructed activations
Lzero be the CE loss of the model when activations are replaced with the zero vector
then
reconstruction score =Lzero−LreconstructedLzero−Loriginalso, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
It’s not clear that normalizing with Lzero makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.
Added to Experiments-Performance Validation!
I think just showing Lreconstruction would be better than reconstruction score metric because L0 is very noisy.
I don’t think Lreconstruction is very informative here, as it’s highly impacted by the input batch. Both the raw Lreconstruction and Lclean have large variances at different verification steps, and since we mainly care about how good our reconstruction is compared with the original, I think the reconstruction score is good as is. I also don’t follow why the noisiness of L0 leads to showing Lreconstruction.