Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)
Good question.
First and most important: if you know beforehand that you’re at risk of entering such a state, then you should (according to your current values) probably put mechanisms in place to pressure your future self to restore your old reward stream. (This is not to say that fully preserving the reward stream is always the right thing to do, but the question of when one shouldn’t conserve one’s reward stream is a separate one which we can factor apart from the question at hand.)
… and AFAICT, it happens that the human brain already works in a way which would make that happen to some extent by default. In particular, most of our day-to-day planning draws on cached value-estimates which would still remain, at least for a time, even if the underlying rewards suddenly zeroed out.
… and it also happens that other humans, like e.g. your friends, would probably prefer (according to their values) for you to have roughly-ordinary reward signals rather than zeros. So that would also push in a similar direction.
And again, you might decide to edit the rewards away from the original baseline afterwards. But that’s a separate question.
On the other hand, consider a mind which was never human in the first place, never had any values or rewards, and is given the same ability to modify its rewards as in your hypothetical. Then—I claim—that mind has no particular reason to favor any rewards at all. (Although we humans might prefer that it choose some particular rewards!)
Your question touched on several different things, so let me know if that missed the parts you were most interested in.