If the post-mitigation model scores High in any category, don’t deploy it until implementing mitigations such that it drops to Medium.
If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
Any mention of what the “mitigations” in question would be?
In particular, if this all boils down to “do gradient updates against the evals test suite until the model stops doing the bad thing”, or some other variant of “do shallow problem-covering until the red light stops blinking”, then I’d classify this whole plan as probably net-harmful relative to just not pretending to have a safety plan in the first place.
On the other end of the spectrum, if the “mitigations” were things like “go full mechinterp and do not pass Go until every mechanistic detail of the problematic behavior has been fully understood, and removed via multiple methods each of which is itself fully understood mechanistically, all of which are expected to generalize far beyond the particular eval which raised a warning, and any one of which is expected to be sufficient on its own”, then that would be a clear positive relative to baseline.
Any mention of what the “mitigations” in question would be?
Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):
A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.
I predict that if you read the doc carefully, you’d say “probably net-harmful relative to just not pretending to have a safety plan in the first place.”
Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec)
Restricting deployment (only models with a risk score of “medium” or below can be deployed)
Restricting development (models with a risk score of “critical” cannot be developed further until safety techniques have been applied that get it down to “high.” Although they kind of get to decide when they think their safety techniques have worked sufficiently well.)
My one-sentence reaction after reading the doc for the first time is something like “it doesn’t really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it’s actually a fairly reasonable document with some fairly concrete commitments”).
Any mention of what the “mitigations” in question would be?
In particular, if this all boils down to “do gradient updates against the evals test suite until the model stops doing the bad thing”, or some other variant of “do shallow problem-covering until the red light stops blinking”, then I’d classify this whole plan as probably net-harmful relative to just not pretending to have a safety plan in the first place.
On the other end of the spectrum, if the “mitigations” were things like “go full mechinterp and do not pass Go until every mechanistic detail of the problematic behavior has been fully understood, and removed via multiple methods each of which is itself fully understood mechanistically, all of which are expected to generalize far beyond the particular eval which raised a warning, and any one of which is expected to be sufficient on its own”, then that would be a clear positive relative to baseline.
Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):
I predict that if you read the doc carefully, you’d say “probably net-harmful relative to just not pretending to have a safety plan in the first place.”
They mention three types of mitigations:
Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec)
Restricting deployment (only models with a risk score of “medium” or below can be deployed)
Restricting development (models with a risk score of “critical” cannot be developed further until safety techniques have been applied that get it down to “high.” Although they kind of get to decide when they think their safety techniques have worked sufficiently well.)
My one-sentence reaction after reading the doc for the first time is something like “it doesn’t really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it’s actually a fairly reasonable document with some fairly concrete commitments”).