I think we’d agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well.
I’m particularly worried about MI people studying instances of when LLMs do and don’t express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don’t study/publish papers on situational awareness?
On the other hand, interpretability research is probably crucial for AI alignment.
As you say in your linked post, I think it’s important to distinguish between mechanistic interp and broadly construed model-internals (“interpretability”) research. That being said, my guess is that we’d agree that the broadly construed version of interpretability (“using non-input-output modalities of model interaction to better predict or describe the behavior of models”) is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.
I think we’d agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well.
To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don’t study/publish papers on situational awareness?
As you say in your linked post, I think it’s important to distinguish between mechanistic interp and broadly construed model-internals (“interpretability”) research. That being said, my guess is that we’d agree that the broadly construed version of interpretability (“using non-input-output modalities of model interaction to better predict or describe the behavior of models”) is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.
Thanks
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.