Beyond-episode goals: that is, the model cares about the consequences of its actions after the episode is complete.
I claim this part is basically unneccesary. Once the AI has situational awareness, if it’s optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode.
This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to “seize the raters and put them in special ‘thumbs up’-only suits”.
[The below is copied from the dialogue.]
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this “thumbs-up-only suits” scenario comes up, the AI’s actual reasoning will route through something like:
Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits.
If I do that, then the actual humans who produce the actual ratings will give a bad rating; they would not actually be in thumbs-up-only suits.
Therefore I won’t do that.
… and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits.
If I do that, then the actual ratings will be great.
Therefore I do that.
(This sounds like a typical “the AI is strategically aware, and knows it is in a simulation” story, and it is. But note two things which are not always present in such stories:
First, there’s a clear reason for the AI to at least consider the hypothesis that it’s in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process.
Second, the AI’s cognition doesn’t involve any explicit deception, or even any non-myopia; this story all goes through just fine even if it’s only optimizing for single-episode reward during training. It doesn’t need to be planning ahead about getting into deployment, or anything like that, it’s just using an accurate model of the training process.
I claim this part is basically unneccesary. Once the AI has situational awareness, if it’s optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode.
This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to “seize the raters and put them in special ‘thumbs up’-only suits”.
[The below is copied from the dialogue.]
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this “thumbs-up-only suits” scenario comes up, the AI’s actual reasoning will route through something like:
Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits.
If I do that, then the actual humans who produce the actual ratings will give a bad rating; they would not actually be in thumbs-up-only suits.
Therefore I won’t do that.
… and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits.
If I do that, then the actual ratings will be great.
Therefore I do that.
(This sounds like a typical “the AI is strategically aware, and knows it is in a simulation” story, and it is. But note two things which are not always present in such stories:
First, there’s a clear reason for the AI to at least consider the hypothesis that it’s in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process.
Second, the AI’s cognition doesn’t involve any explicit deception, or even any non-myopia; this story all goes through just fine even if it’s only optimizing for single-episode reward during training. It doesn’t need to be planning ahead about getting into deployment, or anything like that, it’s just using an accurate model of the training process.
)
I agree that AIs only optimizing for good human ratings on the episode (what I call “reward-on-the-episode seekers”) have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on “non-schemers with schemer-like traits”). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e., reasons to do with what I call “responsiveness to honest tests,” the ambition and temporal scope of their goals, and their propensity to engage in various forms of sandbagging and what I call “early undermining”). And this especially for reward-on-the-episode seekers with fairly short episodes, where grabbing control over the reward process may not be feasible on the relevant timescales.