(I’m just going to speak for myself here, rather than the other authors, because I don’t want to put words in anyone else’s mouth. But many of the ideas I describe in this review are due to other people.)
I think this work was a solid intellectual contribution. I think that the metric proposed for how much you’ve explained a behavior is the most reasonable metric by a pretty large margin.
The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I’m glad we did. But these negative results haven’t had that much influence on other people’s work AFAICT, so overall it seems somewhat low impact.
The empirical results in this paper demonstrated that induction heads are not the simple circuit which many people claimed (see this post for a clearer statement of that), and we then used these techniques to get mediocre results for IOI (described in this comment).
There hasn’t been much followup on this work. I suspect that the main reasons people haven’t built on this are:
it’s moderately annoying to implement it
it makes your explanations look bad (IMO because they actually are unimpressive), so you aren’t that incentivized to get it working
the interp research community isn’t very focused on validating whether its explanations are faithful, and in any case we didn’t successfully persuade many people that explanations performing poorly according to this metric means they’re importantly unfaithful
I think that interpretability research isn’t going to be able to produce explanations that are very faithful explanations of what’s going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don’t seem very important to me now.
(I think that people who want to do research that uses model internals should evaluate their techniques by measuring performance on downstream tasks (e.g. weak-to-strong generalization and measurement tampering detection) instead of trying to use faithfulness metrics.)
I wish we’d never bothered with trying to produce faithful explanations (or researching interpretability at all). But causal scrubbing was important in convincing us to stop working on this, so I’m glad for that.
Another reflection question: did we really have to invent this whole recursive algorithm? Could we have just done something simpler?
My guess is that no, we couldn’t have done something simpler–the core contribution of CaSc is to give you a single number for the whole explanation, and I don’t see how to get that number without doing something like our approach where you apply every intervention at the same time.
I agree with the overall point (that this was a solid intellectual contribution and is a reasonable-ish metric), but there’s been a non-zero amount of followups or at least use cases of this work, imo. Off the top of my head:
In general, CaSc has been used on lots of toy/tiny models to a decent level of success. I agree that part of the reason for CaSc’s lack of adoption is that the metric consistently returns “this explanation is not very faithful/complete/etc”. For example:
I checked the hypotheses for the toy modular arithmetic/group composition work with my own hand-crafted CaSc implementation and found that the modular arithmetic results held up quite well.
CaSc-style tests were used by Marius and Stefan to confirm their solutions to Stephen Casper’s Mech Interp challenges (challenge 1, challenge 2).
etc.
Erik Jenner’s agenda is pretty closely related to causal scrubbing and is still actively being worked on.
I think that interpretability research isn’t going to be able to produce explanations that are very faithful explanations of what’s going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don’t seem very important to me now.
By “explanations” you mean labeled high-level causal graphs right? Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
By explanations, I think Buck means fully human understandable explanations.
Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.
(I’m just going to speak for myself here, rather than the other authors, because I don’t want to put words in anyone else’s mouth. But many of the ideas I describe in this review are due to other people.)
I think this work was a solid intellectual contribution. I think that the metric proposed for how much you’ve explained a behavior is the most reasonable metric by a pretty large margin.
The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I’m glad we did. But these negative results haven’t had that much influence on other people’s work AFAICT, so overall it seems somewhat low impact.
The empirical results in this paper demonstrated that induction heads are not the simple circuit which many people claimed (see this post for a clearer statement of that), and we then used these techniques to get mediocre results for IOI (described in this comment).
There hasn’t been much followup on this work. I suspect that the main reasons people haven’t built on this are:
it’s moderately annoying to implement it
it makes your explanations look bad (IMO because they actually are unimpressive), so you aren’t that incentivized to get it working
the interp research community isn’t very focused on validating whether its explanations are faithful, and in any case we didn’t successfully persuade many people that explanations performing poorly according to this metric means they’re importantly unfaithful
I think that interpretability research isn’t going to be able to produce explanations that are very faithful explanations of what’s going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don’t seem very important to me now.
(I think that people who want to do research that uses model internals should evaluate their techniques by measuring performance on downstream tasks (e.g. weak-to-strong generalization and measurement tampering detection) instead of trying to use faithfulness metrics.)
I wish we’d never bothered with trying to produce faithful explanations (or researching interpretability at all). But causal scrubbing was important in convincing us to stop working on this, so I’m glad for that.
See the dialogue between Ryan Greenblatt, Neel Nanda, and me for more discussion of all this.
—
Another reflection question: did we really have to invent this whole recursive algorithm? Could we have just done something simpler?
My guess is that no, we couldn’t have done something simpler–the core contribution of CaSc is to give you a single number for the whole explanation, and I don’t see how to get that number without doing something like our approach where you apply every intervention at the same time.
I agree with the overall point (that this was a solid intellectual contribution and is a reasonable-ish metric), but there’s been a non-zero amount of followups or at least use cases of this work, imo. Off the top of my head:
In general, CaSc has been used on lots of toy/tiny models to a decent level of success. I agree that part of the reason for CaSc’s lack of adoption is that the metric consistently returns “this explanation is not very faithful/complete/etc”. For example:
I checked the hypotheses for the toy modular arithmetic/group composition work with my own hand-crafted CaSc implementation and found that the modular arithmetic results held up quite well.
CaSc-style tests were used by Marius and Stefan to confirm their solutions to Stephen Casper’s Mech Interp challenges (challenge 1, challenge 2).
etc.
Erik Jenner’s agenda is pretty closely related to causal scrubbing and is still actively being worked on.
Thanks for the links! I agree that the usecases are non-zero.
By “explanations” you mean labeled high-level causal graphs right? Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
By explanations, I think Buck means fully human understandable explanations.
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.