Zach Stein-Perlman comments on Zach Stein-Perlman’s Shortform

Zach Stein-Perlman 11 Dec 2024 5:25 UTC
37 points
7
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet operationalized thresholds/buffers as far as I know).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.^[1] This isn’t going well. The 4o evals were rushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
1. ^
  It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.