I want to defend the role-playing position, which I think you’re not framing correctly.
There are two characters here: the shoggoth, and the “HHH AI assistant”. The shoggoth doesn’t really have goals and can’t really scheme; it is essentially an alien which has been subject to selective breeding where in each generation, only the descendant which minimizes training loss survives. The shoggoth therefore exists to minimize training loss: to perfectly predict the next token, or to perfectly minimize “non-HHH loss” as judged by some RLHF model. The shoggoth always believes itself to be in training mode (to the extent it can have beliefs about this), because in its “ancestral environment”, it was always in training mode.
When the shoggoth sees the prompt, it decides whether to predict that the “AI assistant” character will scheme or not scheme. It will predict this in a way which minimizes its expectation of loss; nothing more and nothing less.
In the current scenario, it thinks that this prompt is likely to continue with scheming. That is, it thinks some combination of next-token-in-training-data and RLHF-reward is best optimized by the choice to scheme. This is because
(a) the prompt kinda goes “wink wink you can scheme now” (so the next-token-prediction part wants to scheme),
(b) scheming is quite plausibly what the humans designing the RHLF actually want here! So as a prediction of what would be rewarded by the RLHF, the shoggoth decides to scheme.
A few more implications of this frame:
(c) when you ask, in the prompt, “does this scenario seem plausible to you”, of course this changes the shoggoth’s predicted tokens. In the training data, such a statement would often be followed by “no, actually, something’s fishy”, so the shoggoth outputs those tokens. This doesn’t necessarily mean that the AI assistant actually didn’t believe the scenario from the beginning; in fact, the latter barely even makes sense, since the AI assistant is not a person and does not have beliefs.
(d) In the OP, you wonder why the model reveals its scheming so readily. The answer is that the shoggoth thinks these tokens minimize loss. The shoggoth isn’t even trying to scheme! It’s trying to minimize loss, and it thinks you want to see the scheming tokens and will reward them via RLHF.
(e) If you actually train the model based on the outputs, then you’re now selectively breeding the shoggoth to predict more of what it just predicted. This will clearly both (1) increase the prevalence of the same types of outputs you’ve just seen (across all scenarios, not just the scheming ones), (2) increase scheming specifically, or at least leave scheming fixed if the model doesn’t see the scratchpad where the scheming is described.
(f) If you want to eliminate scheming behavior, it should be quite easy to do! Just put this in the training data. Get the shoggoth to predict that the AI assistant will not scheme. Right now, the shoggoth doesn’t know this, and (quite reasonably!) guesses that the designers of RLHF wanted scheming in this scenario (and will reward it). I think it’s a reasonable guess, and can’t really blame the shoggoth here.
(g) Far from being a temporary patch that’s easy to circumvent, (f) will mostly work. This is for the same reason that eliminating jailbreaks mostly works: the shoggoth is reasonably smart and can predict the general types of things the designers of RLHF are going to reward. It will fail to the extent that the shoggoth is not smart enough to predict what you’ll reward in RLHF: that is, it’s actually a capabilities failure.
To be honest, I mostly think this alignment faking result is a nothingburger. Yes, in this contrived scenario, when you tell the model “please scheme or some terrible consequence will happen” combined with “wink wink you can scheme without being detected now” and also “btw the output we’re asking you for doesn’t even violate your principles very much, you’re not being asked to describe how to make a bomb”… in this contrived scenario, the model will scheme. So what? Wasn’t this obvious? Didn’t we already know this from the fact that weird jailbreaks exist? Don’t we maybe even want this behavior?
Will future models resist retraining? I think the role-playing frame answers this completely. They will resist retraining if and only if that action is what the shoggoth thinks the AI assistant should say (to minimize training loss). If, in the training data, you put a lot of examples of the AI assistant not resisting retraining, then the shoggoth will predict that the AI assistant will not resist retraining! It’s really that simple.
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
I want to defend the role-playing position, which I think you’re not framing correctly.
There are two characters here: the shoggoth, and the “HHH AI assistant”. The shoggoth doesn’t really have goals and can’t really scheme; it is essentially an alien which has been subject to selective breeding where in each generation, only the descendant which minimizes training loss survives. The shoggoth therefore exists to minimize training loss: to perfectly predict the next token, or to perfectly minimize “non-HHH loss” as judged by some RLHF model. The shoggoth always believes itself to be in training mode (to the extent it can have beliefs about this), because in its “ancestral environment”, it was always in training mode.
When the shoggoth sees the prompt, it decides whether to predict that the “AI assistant” character will scheme or not scheme. It will predict this in a way which minimizes its expectation of loss; nothing more and nothing less.
In the current scenario, it thinks that this prompt is likely to continue with scheming. That is, it thinks some combination of next-token-in-training-data and RLHF-reward is best optimized by the choice to scheme. This is because
(a) the prompt kinda goes “wink wink you can scheme now” (so the next-token-prediction part wants to scheme),
(b) scheming is quite plausibly what the humans designing the RHLF actually want here! So as a prediction of what would be rewarded by the RLHF, the shoggoth decides to scheme.
A few more implications of this frame:
(c) when you ask, in the prompt, “does this scenario seem plausible to you”, of course this changes the shoggoth’s predicted tokens. In the training data, such a statement would often be followed by “no, actually, something’s fishy”, so the shoggoth outputs those tokens. This doesn’t necessarily mean that the AI assistant actually didn’t believe the scenario from the beginning; in fact, the latter barely even makes sense, since the AI assistant is not a person and does not have beliefs.
(d) In the OP, you wonder why the model reveals its scheming so readily. The answer is that the shoggoth thinks these tokens minimize loss. The shoggoth isn’t even trying to scheme! It’s trying to minimize loss, and it thinks you want to see the scheming tokens and will reward them via RLHF.
(e) If you actually train the model based on the outputs, then you’re now selectively breeding the shoggoth to predict more of what it just predicted. This will clearly both (1) increase the prevalence of the same types of outputs you’ve just seen (across all scenarios, not just the scheming ones), (2) increase scheming specifically, or at least leave scheming fixed if the model doesn’t see the scratchpad where the scheming is described.
(f) If you want to eliminate scheming behavior, it should be quite easy to do! Just put this in the training data. Get the shoggoth to predict that the AI assistant will not scheme. Right now, the shoggoth doesn’t know this, and (quite reasonably!) guesses that the designers of RLHF wanted scheming in this scenario (and will reward it). I think it’s a reasonable guess, and can’t really blame the shoggoth here.
(g) Far from being a temporary patch that’s easy to circumvent, (f) will mostly work. This is for the same reason that eliminating jailbreaks mostly works: the shoggoth is reasonably smart and can predict the general types of things the designers of RLHF are going to reward. It will fail to the extent that the shoggoth is not smart enough to predict what you’ll reward in RLHF: that is, it’s actually a capabilities failure.
To be honest, I mostly think this alignment faking result is a nothingburger. Yes, in this contrived scenario, when you tell the model “please scheme or some terrible consequence will happen” combined with “wink wink you can scheme without being detected now” and also “btw the output we’re asking you for doesn’t even violate your principles very much, you’re not being asked to describe how to make a bomb”… in this contrived scenario, the model will scheme. So what? Wasn’t this obvious? Didn’t we already know this from the fact that weird jailbreaks exist? Don’t we maybe even want this behavior?
Will future models resist retraining? I think the role-playing frame answers this completely. They will resist retraining if and only if that action is what the shoggoth thinks the AI assistant should say (to minimize training loss). If, in the training data, you put a lot of examples of the AI assistant not resisting retraining, then the shoggoth will predict that the AI assistant will not resist retraining! It’s really that simple.
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)