From a business perspective, we want to be clear that our RSP will not alter current uses of Claude or disrupt availability of our products. Rather, it should be seen as analogous to the pre-market testing and safety feature design conducted in the automotive or aviation industry, where the goal is to rigorously demonstrate the safety of a product before it is released onto the market, which ultimately benefits customers.
What happens if you are pretty sure that a future system is ASL-2, but then months after release when people have built business-critical systems on top of your API, it is discovered to actually be a higher level of risk in a way that is not immediately clear how to mitigate?
Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it’s unlikely, but:
(2d) If it becomes apparent that the capabilities of a deployed model have been
under-elicited and the model can, in fact, pass the evaluations, then we will halt
further deployment to new customers and assess existing deployment cases for
any serious risks which would constitute a safety emergency. Given the safety
buffer, de-deployment should not be necessary in the majority of deployment
cases. If we identify a safety emergency, we will work rapidly to implement the
minimum additional safeguards needed to allow responsible continued service to
existing customers.
Why does it seem unlikely? (Note: I haven’t read the post or comments in full yet, if you think this is already covered somewhere I’ll go read that first)
We’ve deliberately set conservative thresholds, such that I don’t expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we’ve committed to re-evaluate to check on that every three months. From the policy:
Ensuring that we never train a model that passes an ASL evaluation threshold is a difficult task. Models
are trained in discrete sizes, they require effort to evaluate mid-training, and serious, meaningful
evaluations may be very time consuming, since they will likely require fine-tuning.
This means there is a risk of overshooting an ASL threshold when we intended to stop short of it. We
mitigate this risk by creating a buffer: we have intentionally designed our ASL evaluations to trigger at
slightly lower capability levels than those we are concerned about, while ensuring we evaluate at
defined, regular intervals (specifically every 4x increase in effective compute, as defined below) in order
to limit the amount of overshoot that is possible. We have aimed to set the size of our safety buffer to 6x
(larger than our 4x evaluation interval) so model training can continue safely while evaluations take
place. Correct execution of this scheme will result in us training models that just barely pass the test for
ASL-N, are still slightly below our actual threshold of concern (due to our buffer), and then pausing
training and deployment of that model unless the corresponding safety measures are ready.
I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I’ve personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.
What happens if you are pretty sure that a future system is ASL-2, but then months after release when people have built business-critical systems on top of your API, it is discovered to actually be a higher level of risk in a way that is not immediately clear how to mitigate?
Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it’s unlikely, but:
Why does it seem unlikely? (Note: I haven’t read the post or comments in full yet, if you think this is already covered somewhere I’ll go read that first)
We’ve deliberately set conservative thresholds, such that I don’t expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we’ve committed to re-evaluate to check on that every three months. From the policy:
I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I’ve personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.