On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I’m pretty optimistic about detecting dangerous capabilities; I’m more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
What’s the probability associated with that “should”? The higher it is the less of a concern this point is, but I don’t think it’s high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they’re used to halt work entirely.)
I don’t think safety buffers are a good solution; I think they’re helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it’s safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we’re going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it’s not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being ‘early’?
The relevant section of the RSP:
Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.
I think it’s sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I’m concerned about the “dangerous information becomes more widely available” clause. Suppose you currently can’t get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don’t want companies that make models which are ‘safe’ except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.)
[There’s a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google search are honeypot sites and Anthropic is not. The baseline is not just “is the information available?” but “who is noticing you accessing the information?”.]
4. I mean, superior alternatives always preferred. I am moderately optimistic about “just stop” plans, and am not yet convinced that “scale until our tests tell us to stop” is dramatically superior to “stop now.”
(Like, I think the hope here is to have an AI summer while we develop alignment methods / other ways to make humanity more prepared for advanced AI; it is not clear to me that doing that with the just-below-ASL-3 model is all that much better than doing it with the ASL-2 models we have today.)
Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it’s not “high enough to write off this point.” I just feel like this is an engineering problem, not a flawed “core assumption.”
I mostly disagree with your criticisms.
On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I’m pretty optimistic about detecting dangerous capabilities; I’m more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
There’s a good solution: build safety buffers into your model evals. See https://www-files.anthropic.com/production/files/responsible-scaling-policy-1.0.pdf#page=11. “the RSP is unclear on how to handle that ambiguity” is wrong, I think; Anthropic treats a model as RSP-2 until the evals-with-safety-buffers trigger, and then they implement ASL-3 safety measures.
I don’t really get this.
Shrug; maybe. I wish I was aware of a better plan… but I’m not.
What’s the probability associated with that “should”? The higher it is the less of a concern this point is, but I don’t think it’s high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they’re used to halt work entirely.)
I don’t think safety buffers are a good solution; I think they’re helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it’s safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we’re going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it’s not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being ‘early’?
The relevant section of the RSP:
I think it’s sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I’m concerned about the “dangerous information becomes more widely available” clause. Suppose you currently can’t get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don’t want companies that make models which are ‘safe’ except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.)
[There’s a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google search are honeypot sites and Anthropic is not. The baseline is not just “is the information available?” but “who is noticing you accessing the information?”.]
4. I mean, superior alternatives always preferred. I am moderately optimistic about “just stop” plans, and am not yet convinced that “scale until our tests tell us to stop” is dramatically superior to “stop now.”
(Like, I think the hope here is to have an AI summer while we develop alignment methods / other ways to make humanity more prepared for advanced AI; it is not clear to me that doing that with the just-below-ASL-3 model is all that much better than doing it with the ASL-2 models we have today.)
Thanks.
Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it’s not “high enough to write off this point.” I just feel like this is an engineering problem, not a flawed “core assumption.”
[Busy now but I hope to reply to the rest later.]