Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,
I (like scasper) think this work is useful, but I share some of scasper’s concerns.
In particular:
I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
IMO, buzz on twitter systematically overrates the results of this paper and their importance.
I’m uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
I’d be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I’d like to see people try on this and update faster. (Including myself: I’m not that sure!)
I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE’s IMO.
Note that scasper said:
I (like scasper) think this work is useful, but I share some of scasper’s concerns.
In particular:
I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
IMO, buzz on twitter systematically overrates the results of this paper and their importance.
I’m uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
I’d be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I’d like to see people try on this and update faster. (Including myself: I’m not that sure!)
I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE’s IMO.