(Report co-author here.)
(Note: throughout this comment “I think” is used to express personal beliefs; it’s possible that others in Anthropic disagree with me on these points.)
Evan and Sam Bowman already made similar points, but just to be really clear:
The alignment assessment in the system card is not a safety case.
I don’t think that we could write a safety case for Claude Opus 4 that’s “mostly” based on alignment because—as we illustrate in the system card—Claude Opus 4 is not sufficiently aligned. (Though it’s possible that a successful safety case for Claude Opus 4 could rely on a narrow subset of the alignment-esque claims made in the assessment, e.g. lack of effectively concealed coherent goals.)[1]
Rather, I think the “main” inputs to a safety case would be claims like “Claude Opus 4 has insufficient capabilities to cause catastrophic harm even if it is trying its hardest or being misused by someone with a basic technical background” (ASL-3 protections are relevant for this misuse claim). The right place to look in the system card for this information is section 7, not section 4.[2]
When I was helping to write the alignment assessment, my feeling wasn’t “This is a reassuring document; I hope everyone will be reassured.” It was “I feel nervous! I want people to have a clearer picture of what we’re seeing so that they can decide if they should also feel nervous.” If the system card is making you feel nervous rather than reassured, I think that’s a reasonable reaction!
The system card describes a process in which the same evals are run on many snapshots of a model during training, and the results are used to guide the training process towards making all or most of the evals “pass.” And, although it’s not explicitly stated, there seems to be an implicit stopping rule like “we’ll keep on doing this until enough of our eval suite passes, and then we’ll release the resulting checkpoint.” [...]
“Brittle alignment to specific cases”: has effectively memorized the exact cases you use in evals as special cases where it shouldn’t do the bad behaviors under study, while still retaining the underlying capabilities (or even propensities) and exhibiting them across various contexts you didn’t happen to test
This isn’t a good description of our process for this model release. Some ways it seems off:
Our assessment was mostly based on adaptive techniques—think “automated search for inputs that make the model behave badly”—rather than static eval sets.
Example: The “excessive agency” behavior was discovered through a mix of automatic red-teaming to get the model to exhibit concerning behaviors and human model bashing. We never made a specific eval set for this behavior, just noted that it was relatively easy to elicit.
In some cases, we basically only documented behaviors without making an attempt (or making only a very oblique attempt) to mitigate them. (This is related to the “the alignment assessment is not a safety case” point.) Examples:
After discovering the “excessive agency” behavior, we made no attempt to mitigate it.EDIT: this was incorrect, retracted. We had some sort of mitigation here, though I doubt it had much of an effect.For the “blackmail” behavior, aside from some interventions that we hoped would made Claude “less evil in general,” we didn’t attempt to do anything to mitigate the behavior. We made this choice exactly because we didn’t feel like we understood why Claude was choosing to blackmail; absent this understanding, the only intervention we could have reasonably tried was to make IID training data, which we considered an unacceptably narrow patch.
There are some behaviors for which your description is accurate, but I think they’re all pretty mundane, e.g. “training the model to refuse harmful requests.” We don’t do things that, in my judgement, are alike to “train the model not to fake alignment using data that’s IID to a static eval set.”
To be clear, I think the point about adaptive techniques only kicks your concern one level up: Iterating to “pass” our alignment assessment wouldn’t look like “iterating against a static eval set,” but rather like “iterating until our best red-teaming techniques are no longer able to elicit misaligned behavior.” So I think many of the same dynamics ultimately apply. But it’s still worth noting that they start applying a bit later than they would in the static eval case. And again, also worth noting that during this model release, I don’t think we iterated very hard against our alignment assessment.
“Anthropic’s nightmare”: situationally aware, successfully deceptive, knows it’s being evaluated and does sandbagging / alignment-faking / etc in order to get released [...]
If Claude were just faking it the whole way through, how would we know? [...]
This got me to re-read Benton et al, the cited paper, and this time around I noticed something funny about it: their prompts for “human manipulation” basically straight-up tell the model that it’s in an eval! [...]
Addressing questions like “is this sandbagging?” or “if the model were misaligned, what result would we see on this experiment?” Even if you believe the answers to these questions are reassuring, you still have to say that, and explain why.
As a general point of clarification, I think that—in the near-term—sandbagging and alignment faking concerns will be addressed by capabilities arguments, not alignment arguments. This could look like, e.g., showing that models can’t do well on dangerous capabilities evals despite IID fine-tuning (see some discussion here and here).
My current reason for believing that Claude Opus 4 isn’t sandbagging or faking alignment during training is because of lack of capabilities, not lack of propensity. E.g. I don’t think that Claude Opus 4 is capable of faking alignment without being spoon-fed situational information and without alignment-faking reasoning appearing in it’s CoT some noticeable fraction of the time.
To be clear, none of this is directly disagreeing with anything you wrote; it just felt like something worth clarifying. Further, after reflecting on your comment I do wish that we had investigated sandbagging capabilities more and better documented what we thought about them. (Though see section 4.1.2.1 for discussion of sandbagging propensity during dangerous capabilities evals.) So thanks for that!
Instead, Anthropic’s reaction to this tendency in the released model is the hilarious recommendation that users “exercise caution” about prompting the model with phrases like “take initiative,” “act boldly,” or “consider your impact.” Don’t worry, our robot is extremely safe – unless you give it an inspiring pep talk. And then, oh yeah, sure, if you do that it might go haywire. Don’t say we didn’t warn you.
Again, the main thing I want to emphasize is that the alignment assessment was not supposed to be a safety case.
But for context, it’s also worth noting that (as became clear while we were running this assessment), both previous Anthropic models and those made by other developers have had many of these issues for a while! I.e., for a while, models have had alignment issues about as significant as what we document. So, yeah, in the past Anthropic (and other AI developers) couldn’t even say that we warned you; at least now we can!
- ^
This bullet point was edited; the original read: “More importantly, I don’t think that we could write an alignment-based safety case for Claude Opus 4 because—as we illustrate in the system card—Claude Opus 4 is not sufficiently aligned.”
- ^
This bullet point was edited; the original read: “Rather, we believe that Claude Opus 4 is safe to release because it has insufficient capabilities to cause catastrophic harm even if it is trying its hardest or being misused by someone with a basic technical background (the ASL-3 protections are relevant for this misuse claim). The right place to look in the system card for this information is section 7, not section 4.”
One more point I forgot to raise: IMO your most important criticism was
Since the alignment assessment was not a safety case, I’ll change “prevent the release of” to “catch all the alignment issues with.” I.e. I take the claim here to be that our methods are insufficient for catching all of the alignment issues the model actually has.
If you’re talking about future, sufficiently capable models (e.g. models that can covertly sandbag or fake alignment), then I definitely agree.
If you mean that Claude Opus 4 might have important misaligned behaviors that were not caught by this audit, then I’m genuinely unsure! I’m pretty excited to get empirical evidence here based on whether important new misaligned behaviors are discovered that were not mentioned in the report. If anyone reading this is skeptical of our audit’s coverage, I encourage you to try to surface a behavior that makes us go “Oof, that’s an important thing we missed.”