This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there’s a lot more work to be done.
Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.
Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I’m fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I’d be more likely to agree with Steven Casper if there isn’t further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can’t run the scaling experiment.
Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,
I (like scasper) think this work is useful, but I share some of scasper’s concerns.
In particular:
I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
IMO, buzz on twitter systematically overrates the results of this paper and their importance.
I’m uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
I’d be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I’d like to see people try on this and update faster. (Including myself: I’m not that sure!)
I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE’s IMO.
I don’t think the post is saying the result is not valuable. The claim is that it underperformed expectation. Stock prices fall if they underperformed expectation, even if they are profitable. That does not mean they made loss.
This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there’s a lot more work to be done.
Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.
Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I’m fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I’d be more likely to agree with Steven Casper if there isn’t further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can’t run the scaling experiment.
Note that scasper said:
I (like scasper) think this work is useful, but I share some of scasper’s concerns.
In particular:
I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
IMO, buzz on twitter systematically overrates the results of this paper and their importance.
I’m uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
I’d be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I’d like to see people try on this and update faster. (Including myself: I’m not that sure!)
I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE’s IMO.
I don’t think the post is saying the result is not valuable. The claim is that it underperformed expectation. Stock prices fall if they underperformed expectation, even if they are profitable. That does not mean they made loss.