Redwood Research
Alex Mallen
Recent Redwood Research project proposals
Why Do Some Language Models Fake Alignment While Others Don’t?
IMO the main implications of this update are:
The probability of scheming increases, as I describe here.
Non-scheming reward-seekers might take over too (e.g. without-specific-countermeasures-style)
We get what we can measure. Getting models to try to do hard-to-verify tasks seems like it will be harder than I expected. Long-term strategic advice, safety research, and philosophy are probably hard to verify relative to capabilities R&D, so we go into the intelligence explosion unprepared.
Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).
Alex Mallen’s Shortform
Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified[1] reward affects risk from scheming behavior.
I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances.
Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming.
It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient.
Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition.
There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers.
To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers.
These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming.
Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers.
Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.- ^
By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer.
- ^
When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.
I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then. If that’s false, then I’d call it a behavioral schemer. It’s a broad definition, I know, but the behavior is ultimately what matters so that’s what I’m trying to get at.
I would have thought that the main distinction between schemers and reward hackers was how they came about
Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.
virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a “you get what you measure” catastrophe scenario.)
It doesn’t count the classic reward hackers in “you get what you measure” because the reward hackers were not trying to gain long-term power with their actions. The key difference being between “substantial material action to gain long-term power” and “substantial material action that gains the AIs long-term power”. I could clarify by adding the word “intended”.
Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”.
is this empirically how people use “schemer”,
Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception.
s.t. I should give up on … using “scheming” as referring to training-time scheming
It’s probably fine to use “scheming” in this way in conversation with sufficient context.
if redwood says that something is intended to reduce “catastrophic risk from schemers”, should I interpret that as ~synonymous with “catastrophic risk from misaligned AI”
The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it’s a worst-case misaligned AI.
A quick list of reward hacking interventions
The case for countermeasures to memetic spread of misaligned values
Thanks, I hadn’t seen that before.
I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.
Political sycophancy as a model organism of scheming
I think the King has real power whether or not they were elected/selected, in the same way that you have free will whether or not your actions can be predicted.
But if the King has to worry about regicide then that reduces the King’s power, because now the King has fewer options.
While I overall share the preference for Wizard power over King power for aesthetic and moral reasons, I don’t think that Wizard power is a more effective way to gain real power than King power.
I think a King’s power is largely real: they can cause wars and collect taxes etc. And I think King power sums up to a much larger amount of real power in today’s world than Wizard power. I think the consideration of whether you’re fake-leading the parade only goes so far. Social leaders still have a substantially larger voices with which to steer the parade than everyone else. Especially in the age of AGI, leaders may no longer need to respect the values of most people because they’re not economically relevant. You may worry about losing King power over a more Wizard-powerful AI, but unfortunately it’s pretty intractable to outrace AI in gaining Wizard power without utilizing substantial King power.Psychologizing a bit, I think the thing that people resonate with most about Wizard power is becoming intellectually formidable, epistemically rational, and capable—people here, including myself, have a lot of Carlsmith blue in them aesthetically, and tend to terminally value epistemic rationality.
- May 13, 2025, 5:35 PM; 11 points) 's comment on Orienting Toward Wizard Power by (
Training-time schemers vs behavioral schemers
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes.
For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.
The “we get what we can measure” story leading to doom doesn’t rely on long-term power-seeking. It might be the culmination of myopic power-seeking leading to humans loosing a handle on the world.
Also, capabilities might be tied to alignment in this way, but just because we can’t get the AI to try to do a good job of long-term tasks doesn’t mean they won’t be capable of it.