While we’re on the topic, it’s perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?”
Or perhaps you think it’s slightly better, but not considerably?
If so, I mostly agree—it doesn’t very clear this is much better. I’m something like into:
Picking a distribution
Training a model to perform well on that distribution
Interpreting the model (or parts of the model, etc)
as a default interpretability workflow.
For instance, it’s not very clear to me that IOI is much more interesting that just training a model on some version of the IOI distribution and then interpreting that model. And I think a key problem with IOI is that the model doesn’t really care very much about doing well on this exact task: after having skimmed though copious amounts[1] of OpenWebText, the IOI task as exactly formulated seems pretty non-central IMO.
There are various arguments for looking into narrow examples IMO, but the case is a bit more subtle. (For instance it seems like we should ideally be able to answer questions like ‘why did this model have strange behavior in this narrow distribution’ where the ‘why’ will probably have to make reference to how the model behaves on a broader distribution of interest)
It’s also possible we disagree about how useful it is to do interpretability on toy tasks. I’m not really sure if there’s anything interesting and quick to say here.
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?” Or perhaps you think it’s slightly better, but not considerably?
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.
Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you’re studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading.
On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what’s going on in that model. If you’re studying it over the whole training distribution, features either aren’t in superposition (there’s so much free capacity in most of these models this seem possible!) or else they’ll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you’ve kind of side-stepped superposition, and it’s also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.
(Context, I work at Redwood)
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?” Or perhaps you think it’s slightly better, but not considerably?
If so, I mostly agree—it doesn’t very clear this is much better. I’m something like into:
Picking a distribution
Training a model to perform well on that distribution
Interpreting the model (or parts of the model, etc)
as a default interpretability workflow.
For instance, it’s not very clear to me that IOI is much more interesting that just training a model on some version of the IOI distribution and then interpreting that model. And I think a key problem with IOI is that the model doesn’t really care very much about doing well on this exact task: after having skimmed though copious amounts[1] of OpenWebText, the IOI task as exactly formulated seems pretty non-central IMO.
There are various arguments for looking into narrow examples IMO, but the case is a bit more subtle. (For instance it seems like we should ideally be able to answer questions like ‘why did this model have strange behavior in this narrow distribution’ where the ‘why’ will probably have to make reference to how the model behaves on a broader distribution of interest)
It’s also possible we disagree about how useful it is to do interpretability on toy tasks. I’m not really sure if there’s anything interesting and quick to say here.
I’ve perhaps skimmed somewhere between 10,000 to 100,000 passages? (I haven’t counted)
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.
Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you’re studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading.
On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what’s going on in that model. If you’re studying it over the whole training distribution, features either aren’t in superposition (there’s so much free capacity in most of these models this seem possible!) or else they’ll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you’ve kind of side-stepped superposition, and it’s also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.