To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet operationalized thresholds/buffers as far as I know).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals wererushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet operationalized thresholds/buffers as far as I know).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals were rushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.