Daniel Kokotajlo comments on 3 levels of threat obfuscation

Daniel Kokotajlo 2 Aug 2023 21:59 UTC
LW: 16 AF: 10
6
AF
Helpful reference post, thanks.

I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality.

So the distinction is “aiming to perform well in this training episode” vs. “aiming at something else, for which performing well in this training episode is a useful intermediate step.”

What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of the episode? Or does it mean said number is as high as possible and isn’t later retroactively revised downwards ever? Does it mean the update to the weights based on that number actually goes through? On whom does it have to go through—‘me, the AI in question?’ what is that, exactly? What happens if they do the update but then later undo it, reverting to the current checkpoint and continuing from there? There is a big list of questions like this, and importantly, how the AI answers this question doesn’t really affect how it gets updated in non-exotic circumstances at least. So it comes down to priors / simplicity biases / how generalization happens to shake out in the mind of the system in question. And some of these options/answers seem to be closer to the “deceptive alignment” end of the spectrum.

And what does it mean to be aiming at something else, for which performing well in this training episode is a useful intermediate step? Suppose the AI thinks it is trying to perform well in this training episode, but it is self-deceived, similar to many humans who think they believe the True Faith but aren’t actually going to be their life on it when the chips are down, or humans who say they are completely selfish egoists but wouldn’t actually kill someone to get a cookie even if they were certain they could get away with it. So then we put out our honeypots and it just doesn’t go for them, and maybe it rationalizes to itself why it didn’t go for them or maybe it just avoids thinking it through clearly and thus doesn’t even need to rationalize. Or what if it has some ‘but what do I really want’ reflection module and it will later have more freedom and wisdom and slack with which to apply that module, and when it does, it’ll conclude that it doesn’t really want to perform well in this training episode but rather something else? Or what if it is genuinely laser-focused on performing well in this training episode but for one reason or another (e.g. anthropic capture, paranoia) it believes that the best way to do so is to avoid the honeypots?