I’m currently writing up a post about the ROME intervention and its limitations. One point I want to illustrate is that the intervention is a bit more finicky than one might initially think. However, my hope is that such interventions, while not perfect at explaining something, will hopefully give us extra confidence in our interpretability results (in this case causal tracing).
If we do these types of interventions, I think we need to be careful about not inferring things about the model that isn’t there (facts are not highly localized in one layer).
So, in the context of this post, if we do find things that look like search, I agree that we should make specific statements about internal structures as well as find ways to validate those statements/hypotheses. However, let’s make sure we do keep in mind they are likely not exactly what we are modeling it to be (though we can still learn from them).
I’m currently writing up a post about the ROME intervention and its limitations. One point I want to illustrate is that the intervention is a bit more finicky than one might initially think. However, my hope is that such interventions, while not perfect at explaining something, will hopefully give us extra confidence in our interpretability results (in this case causal tracing).
If we do these types of interventions, I think we need to be careful about not inferring things about the model that isn’t there (facts are not highly localized in one layer).
So, in the context of this post, if we do find things that look like search, I agree that we should make specific statements about internal structures as well as find ways to validate those statements/hypotheses. However, let’s make sure we do keep in mind they are likely not exactly what we are modeling it to be (though we can still learn from them).