starship006 comments on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

starship006 22 May 2024 3:43 UTC
LW: 31 AF: 13
21
AF
It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.
Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism—I assume you are referring specifically to the statement in their paper that:
Although advocates for AI safety guidelines often allude to the “black box” nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.
I think Anthropic’s interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from ‘solved.’ For instance, Chris Olah in the linked NYT article from today:
“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.
Also, in the paper’s section on Inability to Evaluate:
it’s unclear that they’re really getting at the fundamental thing we care about
I think they are overstating how far/useful mechanistic interpretability is currently. However, I don’t think this messaging is close to ‘mechanistic interpretability solves AI Interpretability’ - this error is on a16z, not Anthropic.
- Neel Nanda 22 May 2024 20:27 UTC
  LW: 58 AF: 27
  58
  AF Parent
  +1, I think the correct conclusion is “a16z are making bald faced lies to major governments” not “a16z were misled by Anthropic hype”
- scasper 22 May 2024 4:21 UTC
  LW: 15 AF: 7
  9
  AF Parent
  Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don’t have any disagreements.
  Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like “Sparse autoencoders produce interpretable features for large models” contribute to this.
- the gears to ascension 25 May 2024 14:16 UTC
  11 points
  14
  Parent
  It seems at least somewhat reasonable to ask people to write defensively to guard against their statements being misused by adversarial actors. I recognize this is an annoying ask that may have significant overhead, perhaps it will turn out to not be worth the cost.