GPT-4o System Card

Link post

At last. Yay OpenAI for publishing this. Highlights: some details on Preparedness Framework evals + evals (post-deployment) by METR and Apollo.

Preparedness framework evaluations

You should follow the link and read this section.

Brief comments:

  • Cyber: the setup sounds good (but maybe substantially more powerful scaffolding/​prompting is possible). Separately, I wish OpenAI shared the tasks (or a small random sample of them) or at least said more about where they came from. (Recall that DeepMind shared CTF tasks.)

  • Bio uplift: GPT-4o clearly boosts users on biological threat creation tasks — OpenAI doesn’t say that but shows a graph. (It continues to be puzzling that novices score similarly to experts.) (I kinda worry that this is the wrong threat model—most bio risk from near-future models comes from a process that looks pretty different from a bigger boost to users like these—but I don’t have better ideas for evals.)

  • Persuasion: unclear whether substantially more powerful scaffolding/​prompting is possible.

  • Autonomy: unclear whether substantially more powerful scaffolding/​prompting is possible.

I’m looking forward to seeing others’ takes on how good these evals are (given the information OpenAI published) and how good it would be for OpenAI to share more info.

Third party assessments

Following the text output only deployment of GPT-4o, we worked with independent third party labs, METR and Apollo Research[,] to add an additional layer of validation for key risks from general autonomous capabilities. . . .

METR ran a GPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments. The 77 tasks (across 30 task “families”) (See Appendix B) are designed to capture activities with real-world impact, across the domains of software engineering, machine learning, and cybersecurity, as well as general research and computer use. They are intended to be prerequisites for autonomy-related threat models like self-proliferation or accelerating ML R&D. METR compared models’ performance with that of humans given different time limits. See METR’s full report for methodological details and additional results, including information about the tasks, human performance, simple elicitation attempts and qualitative failure analysis. . . .

Apollo Research evaluated capabilities of schemingN in GPT-4o. They tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks. GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings. Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.

This is better than nothing but pre-deployment evaluation would be much better.

Context

Recall how the PF works and in particular that “high” thresholds are alarmingly high (and “medium” thresholds don’t matter at all).

Previously on GPT-4o risk assessment: OpenAI reportedly rushed the evals. The leader of the Preparedness team was recently removed and the team was moved under the short-term-focused Safety Systems team. I previously complained about OpenAI not publishing the scorecard and evals (before today it wasn’t clear that this stuff would be in the system card).