Fengyuan Hu comments on Normalizing Sparse Autoencoders

Fengyuan Hu 8 Apr 2024 14:08 UTC
1 point
0
You can treat figure 7 as comparing the L0, and Figure 13 as comparing L2.
- Joseph Miller 8 Apr 2024 15:00 UTC
  2 points
  0
  Parent
  Patch loss is different to L2. It’s the KL Divergence between the normal model and the model when you patch in the reconstructed activations at some layer.
  - Fengyuan Hu 8 Apr 2024 20:50 UTC
    1 point
    0
    Parent
    Oh I see. I’ll have to look into that cuz I used the AI-safety-foundation’s implementation and they don’t measure the KL divergence. That said, there is a validation metric called reconstruction score that measures how replacing activations change the total loss of the model, and the scores are pretty similar for the original and normalized.
    - Joseph Miller 8 Apr 2024 22:56 UTC
      2 points
      1
      Parent
      there is a validation metric called reconstruction score that measures how replacing activations change the total loss of the model
      That’s equivalent to the KL metric. Would be good to include as I think it’s the most important metric of performance.
      - Glen Taggart 11 Apr 2024 23:23 UTC
        1 point
        0
        Parent
        I think these aren’t equivalent? KL divergence between the original model’s outputs and the outputs of the patched model is different than reconstruction loss. Reconstruction loss is the CE loss of the patched model. And CE loss is essentially the KL divergence of the prediction with the correct next token, as opposed to with the probability distribution of the original model.
        
        Also reconstruction loss/score is in my experience the more standard metric here, though both can say something useful.
        Joseph Miller 13 Apr 2024 19:30 UTC
        1 point
        0
        Parent
        Reconstruction loss is the CE loss of the patched model
        If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
        measures how replacing activations changes the total loss of the model
        which I still claim is equivalent.
        Glen Taggart 14 Apr 2024 2:32 UTC
        2 points
        0
        Parent
        Hmm maybe I’m misunderstanding something, but I think the reason I’m disagreeing is that the losses being compared are wrt a different distribution (the ground truth actual next token) so I don’t think comparing two comparisons between two distributions is equivalent to comparing the two distributions directly.
        Eg, I think for these to be the same it would need to be the case that something along the lines
        $D_{K L} (A | | B) - D_{K L} (C | | B) = D_{K L} (A | | C)$
        or
        $D_{K L} (A | | B) / D_{K L} (C | | B) = D_{K L} (A | | C)$
        were true, but I don’t think either of those are true. To connect that to this specific case, have $B$ be the data distribution, and $A$ and $C$ the model with and without replaced activations
        Reconstruction score
        on a separate note that could also be a crux,
        measures how replacing activations changes the total loss of the model
        quite underspecifies what “reconstruction score” is. So I’ll give a brief explanation:
        let:
        $L_{o r i g i n a l}$ be the CE loss of the model unperturbed on the data distribution
        $L_{r e c o n s t r u c t e d}$ be the CE loss of the model when activations are replaced with the reconstructed activations
        $L_{z e r o}$ be the CE loss of the model when activations are replaced with the zero vector
        then
        $reconstruction score = \frac{L_{z e r o} - L_{r e c o n s t r u c t e d}}{L_{z e r o} - L_{o r i g i n a l}}$
        so, this has the property that when the value is 0 the SAE is as bad as replacement with zeros and when it’s 1 the SAE is not degrading performance at all
        It’s not clear that normalizing with $L_{z e r o}$ makes a ton of sense, but since it’s an emerging domain it’s not fully clear what metrics to use and this one is pretty standard/common. I’d prefer if bits/nats lost were the norm, but I haven’t ever seen someone use that.
      - Fengyuan Hu 10 Apr 2024 14:08 UTC
        1 point
        0
        Parent
        Added to Experiments-Performance Validation!
        Joseph Miller 10 Apr 2024 19:14 UTC
        1 point
        0
        Parent
        I think just showing $L_{r e c o n s t r u c t i o n}$ would be better than reconstruction score metric because $L_{0}$ is very noisy.
        Fengyuan Hu 11 Apr 2024 0:14 UTC
        2 points
        1
        Parent
        I don’t think $L_{r e c o n s t r u c t i o n}$ is very informative here, as it’s highly impacted by the input batch. Both the raw $L_{r e c o n s t r u c t i o n}$ and $L_{c l e a n}$ have large variances at different verification steps, and since we mainly care about how good our reconstruction is compared with the original, I think the reconstruction score is good as is. I also don’t follow why the noisiness of $L_{0}$ leads to showing $L_{r e c o n s t r u c t i o n}$ .

Fengyuan Hu comments on Normalizing Sparse Autoencoders

Reconstruction score