If we use ELK on the AI output (i.e. extensions of Burns’s Discovering Latent Knowledge paper, where you feed the AI output back into the AI and look for features that the AI notices), and somehow ensure that the AI has not learned to produce output that fools its later self + ELK probe (this second step seems hard; one way you might do this is to ensure the features being picked up by the ELK probe are actually the same ones used/causally responsible for the decision in the first place), then it seems initially to me that this would solve deep deception.
Then any plan that looks to the AI like it’s deception could be caught regardless of which computations led to them (it seems like we can make ELK cheap enough to run at inference time).
(Initial reactions that probably have some holes)
If we use ELK on the AI output (i.e. extensions of Burns’s Discovering Latent Knowledge paper, where you feed the AI output back into the AI and look for features that the AI notices), and somehow ensure that the AI has not learned to produce output that fools its later self + ELK probe (this second step seems hard; one way you might do this is to ensure the features being picked up by the ELK probe are actually the same ones used/causally responsible for the decision in the first place), then it seems initially to me that this would solve deep deception.
Then any plan that looks to the AI like it’s deception could be caught regardless of which computations led to them (it seems like we can make ELK cheap enough to run at inference time).