This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)
That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t have to obfuscate! The AI wants Y, and it acts accordingly, and that leads to good performance and high reward. But it’s still misaligned, because presumably X & Y come apart out-of-distribution.
So anyway, it’s perfectly great for you to write about how a situationally-aware AI might deliberately obfuscate its motivations. I just want us to keep in mind that deliberate obfuscation is not the only reason that it might be hard to notice that an AI is misaligned.
(It’s still probably true that the misalignment isn’t dangerous unless the AI is situationally-aware—for example, executing a treacherous turn presumably requires situational awareness. And also, noticing misalignment via adversarial testing is presumably much easier if the AI is not situationally aware.)
(If the AI is situationally aware, an initial CoinRun-style goal misgeneralization could lead to deception, via means-end reasoning.)
Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision(more details here):
Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can’t read the supervisor’s mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score.
And here’s one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it’s not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it’s not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse.
So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place.
OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation.
It’s still a misaligned motivation though. The idea of process-based supervision is that we don’t want our AI to care about outcomes in the distant future at all. We want it to be totally indifferent. Otherwise it might try to escape onto the internet, etc.
This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)
That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t have to obfuscate! The AI wants Y, and it acts accordingly, and that leads to good performance and high reward. But it’s still misaligned, because presumably X & Y come apart out-of-distribution.
So anyway, it’s perfectly great for you to write about how a situationally-aware AI might deliberately obfuscate its motivations. I just want us to keep in mind that deliberate obfuscation is not the only reason that it might be hard to notice that an AI is misaligned.
(It’s still probably true that the misalignment isn’t dangerous unless the AI is situationally-aware—for example, executing a treacherous turn presumably requires situational awareness. And also, noticing misalignment via adversarial testing is presumably much easier if the AI is not situationally aware.)
(If the AI is situationally aware, an initial CoinRun-style goal misgeneralization could lead to deception, via means-end reasoning.)
Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):
Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can’t read the supervisor’s mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score.
And here’s one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it’s not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it’s not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse.
So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place.
OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation.
It’s still a misaligned motivation though. The idea of process-based supervision is that we don’t want our AI to care about outcomes in the distant future at all. We want it to be totally indifferent. Otherwise it might try to escape onto the internet, etc.
Just noting that these seem like valid points! (Apologies for slow reply!)