Here’s the whole relevant section, I leave it to you to judge how well this matches whatever you usually picture as “reward misspecification”.
Note that our story here isn’t quite “reward misspecification”. That’s why we needed all that machinery about [the AI’s internal concept of] <stuff>. There’s a two-step thing here: the training process gets the AI to optimize for one of its internal concepts, and then that internal concept generalizes differently from whatever-ratings-were-meant-to-proxy-for.
That distinction matters for e.g. this example:
“Part of the rating process will be ‘seizing the raters and putting them in special ‘thumbs up’-only suits...that’s very very bad.′ In simulation, actions like that will be penalized a lot. If it goes and does that exact thing, that means that our training process didn’t work at all.”
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this “thumbs-up-only suits” scenario comes up, the AI’s actual reasoning will route through something like:
Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits.
If I do that, then the actual humans who produce the actual ratings will give a bad rating.
Therefore I won’t do that.
… and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits.
If I do that, then the actual ratings will be great.
Therefore I do that.
(This sounds like a typical “the AI is strategically aware, and knows it is in a simulation” story, and it is. But note two things which are not always present in such stories:
First, there’s a clear reason for the AI to at least consider the hypothesis that it’s in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process.
Second, the AI’s cognition doesn’t involve any explicit deception, or even any non-myopia; this story all goes through just fine even if it’s only optimizing for single-episode reward during training. It doesn’t need to be planning ahead about getting into deployment, or anything like that, it’s just using an accurate model of the training process.
)
That very last bullet point in particular emphasizes the part which I most expect to diverge from most peoples’ go-to mental picture.
Here’s the whole relevant section, I leave it to you to judge how well this matches whatever you usually picture as “reward misspecification”.
That very last bullet point in particular emphasizes the part which I most expect to diverge from most peoples’ go-to mental picture.