OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don’t think that’s necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it’s OK to not test the final model at all.
But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):
They instead said “GPT-4o does not score above Medium risk in any of these categories.” (Maybe they didn’t actually decide whether it’s Low or Medium in some categories!)
While rushing testing of the final model would be OK in some circumstances, OpenAI’s PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic’s RSP, which is supposed to ensure safety with its “safety buffer” between evaluations and doesn’t require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
[Edit: this may be ambiguous: OpenAI explicitly committed to test every 2x increase in effective training compute, but maybe it merely strongly suggests that it’s supposed to test all final models (“We will evaluate all our frontier models, including at every 2x effective compute increase during training runs”; “Only models with a post-mitigation score of ‘medium’ or below can be deployed”). This is mostly moot in this case, since here they claimed to evaluate GPT-4o, not an earlier version.]
[Edit: also maybe lack of third-party audits, but they didn’t commit to do audits at any particular frequency]
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o.
I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly.
The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.
Thanks. Is this because of posttraining? Ignoring posttraining, I’d rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?
two versions with the same posttraining, one with only 90% pretraining are indeed very similar, no need to evaluate both. It’s likely more like one model with 80% pretraining and 70% posttraining of the final model, and the last 30% of posttraining might be significant
OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don’t think that’s necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it’s OK to not test the final model at all.
But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):
They didn’t publish their scorecard
Despite the PF saying they would
They instead said “GPT-4o does not score above Medium risk in any of these categories.” (Maybe they didn’t actually decide whether it’s Low or Medium in some categories!)
They didn’t publish their evals
Despite the PF strongly suggesting they would
Despite committing to in the White House voluntary commitments
While rushing testing of the final model would be OK in some circumstances, OpenAI’s PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic’s RSP, which is supposed to ensure safety with its “safety buffer” between evaluations and doesn’t require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
[Edit: this may be ambiguous: OpenAI explicitly committed to test every 2x increase in effective training compute, but maybe it merely strongly suggests that it’s supposed to test all final models (“We will evaluate all our frontier models, including at every 2x effective compute increase during training runs”; “Only models with a post-mitigation score of ‘medium’ or below can be deployed”). This is mostly moot in this case, since here they claimed to evaluate GPT-4o, not an earlier version.]
[Edit: also maybe lack of third-party audits, but they didn’t commit to do audits at any particular frequency]
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o. I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly. The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.
I agree in theory but testing the final model feels worthwhile, because we want more direct observability and less complex reasoning in safety cases.
Thanks. Is this because of posttraining? Ignoring posttraining, I’d rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?
two versions with the same posttraining, one with only 90% pretraining are indeed very similar, no need to evaluate both. It’s likely more like one model with 80% pretraining and 70% posttraining of the final model, and the last 30% of posttraining might be significant