Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)