Arthur Conmy comments on Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy 1 May 2024 1:02 UTC
LW: 4 AF: 1
0
AF
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit^[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.

EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
1. ^
  What I understand “compute efficiency hit” to mean is: for a given (SAE, $L M_{1}$ ) pair, how much less compute you’d need (as a multiplier) to train a different LM, $L M_{2}$ such that $L M_{2}$ gets the same loss as $L M_{1}$ -with-the-SAE-spliced-in.
What links here?
- SAEs (usually) Transfer Between Base and Chat Models by Connor Kissane (18 Jul 2024 10:29 UTC; 66 points)
- leogao 1 May 2024 1:08 UTC
  LW: 4 AF: 2
  2
  AF Parent
  It doesn’t seem like a huge deal to depend on the existence of smaller LLMs—they’ll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there’s actually just differently important information in different parts of the model.