Logan Riggs comments on Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs 15 Mar 2024 18:29 UTC
2 points
0
There actually is a problem with Pythia-70M-deduped on data that doesn’t start at the initial position. This is the non-deduped vs deduped over training (Note: they’re similar CE if you do evaluate on text that starts on the first position of the document).
We get similar performing SAE’s when training on non-deduped (ie the cos-sim & l2-ratio are similar, though of course the CE will be different if the baseline model is different).
However, I do think the SAE’s were trained on the Pile & I evaluated on OWT, which would lead to some CE-difference as well. Let me check.
Edit: Also the seq length is 256.
- Logan Riggs 15 Mar 2024 18:41 UTC
  2 points
  0
  Parent
  Yep, there are similar results when evaluating on the Pile with lower CE (except at the low L0-end)
  Thanks for pointing this out! I’ll swap the graphs out w/ their Pile-evaluated ones when it runs [Updated: all images are updated except the one comparing the 4 different “lowest features” values]
  We could also train SAE’s on Pythia-70M (non-deduped), but that would take a couple days to run & re-evaluate.