I think you’re confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it’s the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it’s low-upside and so easy to catch, but the mistake is weird.)
Based on the PF, they can deploy a model just below the “high” threshold without mitigations. Based on the tweet and blogpost:
We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.
This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a “high” threshold).
We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
This doesn’t make sense: if you cross a “medium” threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.
(Sidenote: the tweet and blogpost incorrectly suggest that the “medium” thresholds matter for anything; based on the PF, only the “high” and “critical” thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Surely if any categories are above the “high” threshold then they’re in “high zone” and if all are below the “high” threshold then they’re in “medium zone.”
And regardless the reading you describe here seems inconsistent with
We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
[edited]
Added later: I think someone else had a similar reading and it turned out they were reading “crosses a medium risk threshold” as “crosses a high risk threshold” and that’s just [not reasonable / too charitable].
I think you’re confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it’s the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it’s low-upside and so easy to catch, but the mistake is weird.)
Based on the PF, they can deploy a model just below the “high” threshold without mitigations. Based on the tweet and blogpost:
This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a “high” threshold).
This doesn’t make sense: if you cross a “medium” threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.
(Sidenote: the tweet and blogpost incorrectly suggest that the “medium” thresholds matter for anything; based on the PF, only the “high” and “critical” thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)
[edited repeatedly]
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Surely if any categories are above the “high” threshold then they’re in “high zone” and if all are below the “high” threshold then they’re in “medium zone.”
And regardless the reading you describe here seems inconsistent with
[edited]
Added later: I think someone else had a similar reading and it turned out they were reading “crosses a medium risk threshold” as “crosses a high risk threshold” and that’s just [not reasonable / too charitable].