Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it’s unclear whether 90% recovered on this metric actually means that much—GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4′s capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.
Nah I think it’s pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it’s a pain to get the scaling laws precise enough
For our paper this is irrelevant though IMO because we’re comparing gated and normal SAEs, and I think this is just scaling by a constant? It’s at least monotonic in CE loss degradation
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.
EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
What I understand “compute efficiency hit” to mean is: for a given (SAE, LM1) pair, how much less compute you’d need (as a multiplier) to train a different LM, LM2 such that LM2 gets the same loss as LM1-with-the-SAE-spliced-in.
It doesn’t seem like a huge deal to depend on the existence of smaller LLMs—they’ll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there’s actually just differently important information in different parts of the model.
Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it’s unclear whether 90% recovered on this metric actually means that much—GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4′s capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.
Nah I think it’s pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it’s a pain to get the scaling laws precise enough
For our paper this is irrelevant though IMO because we’re comparing gated and normal SAEs, and I think this is just scaling by a constant? It’s at least monotonic in CE loss degradation
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.
EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
What I understand “compute efficiency hit” to mean is: for a given (SAE, LM1) pair, how much less compute you’d need (as a multiplier) to train a different LM, LM2 such that LM2 gets the same loss as LM1-with-the-SAE-spliced-in.
It doesn’t seem like a huge deal to depend on the existence of smaller LLMs—they’ll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there’s actually just differently important information in different parts of the model.