Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism—I assume you are referring specifically to the statement in their paper that:
Although advocates for AI safety guidelines often allude to the “black box” nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.
I think Anthropic’s interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from ‘solved.’ For instance, Chris Olah in the linked NYT article from today:
“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.
Also, in the paper’s section on Inability to Evaluate:
it’s unclear that they’re really getting at the fundamental thing we care about
I think they are overstating how far/useful mechanistic interpretability is currently. However, I don’t think this messaging is close to ‘mechanistic interpretability solves AI Interpretability’ - this error is on a16z, not Anthropic.
Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don’t have any disagreements.
Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like “Sparse autoencoders produce interpretable features for large models” contribute to this.
It seems at least somewhat reasonable to ask people to write defensively to guard against their statements being misused by adversarial actors. I recognize this is an annoying ask that may have significant overhead, perhaps it will turn out to not be worth the cost.
Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism—I assume you are referring specifically to the statement in their paper that:
I think Anthropic’s interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from ‘solved.’ For instance, Chris Olah in the linked NYT article from today:
Also, in the paper’s section on Inability to Evaluate:
I think they are overstating how far/useful mechanistic interpretability is currently. However, I don’t think this messaging is close to ‘mechanistic interpretability solves AI Interpretability’ - this error is on a16z, not Anthropic.
+1, I think the correct conclusion is “a16z are making bald faced lies to major governments” not “a16z were misled by Anthropic hype”
Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don’t have any disagreements.
Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like “Sparse autoencoders produce interpretable features for large models” contribute to this.
It seems at least somewhat reasonable to ask people to write defensively to guard against their statements being misused by adversarial actors. I recognize this is an annoying ask that may have significant overhead, perhaps it will turn out to not be worth the cost.