To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet said it has operationalized thresholds/buffers).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals wererushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it’s pretty reasonable to assume that a bit of further post training hasn’t made things much more dangerous)
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet said it has operationalized thresholds/buffers).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals were rushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it’s pretty reasonable to assume that a bit of further post training hasn’t made things much more dangerous)