aysja comments on Zach Stein-Perlman’s Shortform

aysja 23 Jul 2024 20:57 UTC
7 points
0
Maybe I’m missing the relevant bits, but afaict their preparedness doc says that they won’t deploy a model if it passes the “medium” threshold, eg:
Only models with a post-mitigation score of “medium” or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to “high,” though. I.e., they can further develop so long as models don’t hit the “critical” threshold.
- Zach Stein-Perlman 23 Jul 2024 21:10 UTC
  8 points
  0
  Parent
  I think you’re confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it’s the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it’s low-upside and so easy to catch, but the mistake is weird.)
  Based on the PF, they can deploy a model just below the “high” threshold without mitigations. Based on the tweet and blogpost:
  We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.
  This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a “high” threshold).
  We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
  This doesn’t make sense: if you cross a “medium” threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.
  (Sidenote: the tweet and blogpost incorrectly suggest that the “medium” thresholds matter for anything; based on the PF, only the “high” and “critical” thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)
  [edited repeatedly]
  - aysja 23 Jul 2024 21:46 UTC
    6 points
    0
    Parent
    I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
    In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
    I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
    - Zach Stein-Perlman 23 Jul 2024 21:53 UTC
      4 points
      0
      Parent
      Surely if any categories are above the “high” threshold then they’re in “high zone” and if all are below the “high” threshold then they’re in “medium zone.”
      And regardless the reading you describe here seems inconsistent with
      We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
      [edited]
      Added later: I think someone else had a similar reading and it turned out they were reading “crosses a medium risk threshold” as “crosses a high risk threshold” and that’s just [not reasonable / too charitable].