I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o.
I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly.
The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o. I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly. The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.