Glen Taggart comments on Normalizing Sparse Autoencoders

Glen Taggart 11 Apr 2024 23:23 UTC
1 point
0
I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.

Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
- Joseph Miller 13 Apr 2024 19:30 UTC
  1 point
  0
  Parent
  Reconstruction loss is the CE loss of the patched model
  If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
  measures how replacing activations changes the total loss of the model
  which I still claim is equivalent.
  - Glen Taggart 14 Apr 2024 2:32 UTC
    2 points
    0
    Parent
    Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
    Eg, I think for these to be the same it would need to be the case that something along the lines
    $D_{K L} (A | | B) - D_{K L} (C | | B) = D_{K L} (A | | C)$
    or
    $D_{K L} (A | | B) / D_{K L} (C | | B) = D_{K L} (A | | C)$
    were true, but I don’t think either of those are true. To connect that to this specific case, have $B$ be the data distribution, and $A$ and $C$ the model with and without replaced activations
    Reconstruction score
    on a separate note that could also be a crux,
    measures how replacing activations changes the total loss of the model
    quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
    let:
    $L_{o r i g i n a l}$ be the CE loss of the model unperturbed on the data distribution
    $L_{r e c o n s t r u c t e d}$ be the CE loss of the model when activations are replaced with the reconstructed activations
    $L_{z e r o}$ be the CE loss of the model when activations are replaced with the zero vector
    then
    $reconstruction score = \frac{L_{z e r o} - L_{r e c o n s t r u c t e d}}{L_{z e r o} - L_{o r i g i n a l}}$
    so, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
    It’s not clear that normalizing with $L_{z e r o}$ makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.

Glen Taggart comments on Normalizing Sparse Autoencoders

Reconstruction score