I broadly agree, but I think there’s more safety research along with “Retarget the search” that focuses on using a trained AI’s own internals to understand things like deception, planning, preferences, etc, that you didn’t mention. You did say this sort of thing isn’t a central example of “interpretability,” which I agree with, but somemore typical sorts of interpretability can be clear instrumental goals for this.
E.g. suppose you want to use an AI’s model of human preferences for some reason. To operationalize this, given a description of a situation, you want to pick which of two described alterations to the situation humans would prefer. This isn’t “really interpretability,” it’s just using a trained model in an unintended way that involves hooks.
But if you’re doing this, there are going to be different possible slices of the model that you could have identified as the “model of human preferences.” They might have different generalization behavior even though they get similar scores on a small human labeled dataset. And it’s natural to have questions about these different slices, like “how much are they computing facts about human psychology as intermediaries to its answers, versus treating the preferences as a non-psychological function of the world?”, questions that it would be useful to answer with interpretability tools if we could.
I broadly agree, but I think there’s more safety research along with “Retarget the search” that focuses on using a trained AI’s own internals to understand things like deception, planning, preferences, etc, that you didn’t mention. You did say this sort of thing isn’t a central example of “interpretability,” which I agree with, but some more typical sorts of interpretability can be clear instrumental goals for this.
E.g. suppose you want to use an AI’s model of human preferences for some reason. To operationalize this, given a description of a situation, you want to pick which of two described alterations to the situation humans would prefer. This isn’t “really interpretability,” it’s just using a trained model in an unintended way that involves hooks.
But if you’re doing this, there are going to be different possible slices of the model that you could have identified as the “model of human preferences.” They might have different generalization behavior even though they get similar scores on a small human labeled dataset. And it’s natural to have questions about these different slices, like “how much are they computing facts about human psychology as intermediaries to its answers, versus treating the preferences as a non-psychological function of the world?”, questions that it would be useful to answer with interpretability tools if we could.