Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Whydoes Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?
This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift.
This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:
If [evaluations are] significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
… and here:
As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target.
Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.
But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop.
This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:
If sufficient measures are not yet implemented, pause training and analyze the level of risk presented by the model. In particular, conduct a thorough analysis to determine whether the evaluation was overly conservative, or whether the model indeed presents near-next-ASL risks.
In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.
And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description:
Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
[...]
But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be.
Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”)
I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on.
Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Why does Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?
This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift.
This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:
… and here:
But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target.
Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.
But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop.
This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:
In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.
And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description:
[...]
Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”)
I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on.