A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.
I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.
I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn’t that hard in an absolute sense—people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.
An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.
An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have “aims” at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target—reinforcement learning chisels cognition into a system, it doesn’t necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there’s little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.
I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.
I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn’t that hard in an absolute sense—people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.
An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have “aims” at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target—reinforcement learning chisels cognition into a system, it doesn’t necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there’s little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.