Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).
OpenAI Preparedness scorecard
Context:
OpenAI’s Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it “risk level”), in each risk category they track, before and after mitigations.
When OpenAI released GPT-4o, it said “GPT-4o does not score above Medium risk in any of these categories” but didn’t break down risk level by category.
(I’ve remarked on this repeatedly. I’ve also remarked that the ambiguity suggests that OpenAI didn’t actually decide whether 4o was Low or Medium in some categories, but this isn’t load-bearing for the OpenAI is not following its plan proposition.)
News: a week ago,[1] a “Risk Scorecard” section appeared near the bottom of the 4o page. It says:
Updated May 8, 2024
As part of our Preparedness Framework, we conduct regular evaluations and update scorecards for our models. Only models with a post-mitigation score of “medium” or below are deployed.The overall risk level for a model is determined by the highest risk level in any category. Currently, GPT-4o is assessed at medium risk both before and after mitigation efforts.
This is not what they committed to publish. It’s quite explicit that the scorecard should show risk in each category, not just overall.[2]
(What they promised: a real version of the image below. What we got: the quote above.)
Additionally, they’re supposed to publish their evals and red-teaming.[3] But OpenAI has still said nothing about how it evaluated 4o.
Most frustrating is the failure to acknowledge that they’re not complying with their commitments. If they were transparent and said they’re behind schedule and explained their current plan, that would probably be fine. Instead they’re claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.
“As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk.”
“Scorecard, in which we will indicate our current assessments of the level of risk along each tracked risk category”
This is not as explicit in the PF, but they’re supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.
Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias[:] . . . . publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model’s fitness for deployment [and include the “red-teaming and safety procedures”].
Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the “Unknown Unknowns” row but that row never made sense to me anyway.)
Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).
OpenAI Preparedness scorecard
Context:
OpenAI’s Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it “risk level”), in each risk category they track, before and after mitigations.
When OpenAI released GPT-4o, it said “GPT-4o does not score above Medium risk in any of these categories” but didn’t break down risk level by category.
(I’ve remarked on this repeatedly. I’ve also remarked that the ambiguity suggests that OpenAI didn’t actually decide whether 4o was Low or Medium in some categories, but this isn’t load-bearing for the OpenAI is not following its plan proposition.)
News: a week ago,[1] a “Risk Scorecard” section appeared near the bottom of the 4o page. It says:
This is not what they committed to publish. It’s quite explicit that the scorecard should show risk in each category, not just overall.[2]
(What they promised: a real version of the image below. What we got: the quote above.)
Additionally, they’re supposed to publish their evals and red-teaming.[3] But OpenAI has still said nothing about how it evaluated 4o.
Most frustrating is the failure to acknowledge that they’re not complying with their commitments. If they were transparent and said they’re behind schedule and explained their current plan, that would probably be fine. Instead they’re claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.
Archive versions:
May 13 (original)
July 24 (similar to original)
July 26 (“scorecard” added)
Two relevant quotes:
“As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk.”
“Scorecard, in which we will indicate our current assessments of the level of risk along each tracked risk category”
And there are no suggestions to the contrary.
This is not as explicit in the PF, but they’re supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.
Regardless, this is part of the White House voluntary commitments:
For more on commitments, see https://ailabwatch.org/resources/commitments/.
Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the “Unknown Unknowns” row but that row never made sense to me anyway.)