How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
Do control evaluations
Achieve alignment (and get strong evidence of that)
Related: RSPs, safety cases.
Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.