At last. Yay OpenAI for publishing this. Highlights: some details on Preparedness Framework evals + evals (post-deployment) by METR and Apollo.
Preparedness framework evaluations
You should follow the link and read this section.
Brief comments:
Cyber: the setup sounds good (but maybe substantially more powerful scaffolding/prompting is possible). Separately, I wish OpenAI shared the tasks (or a small random sample of them) or at least said more about where they came from. (Recall that DeepMind shared CTF tasks.)
Bio uplift: GPT-4o clearly boosts users on biological threat creation tasks — OpenAI doesn’t say that but shows a graph. (It continues to be puzzling that novices score similarly to experts.) (I kinda worry that this is the wrong threat model—most bio risk from near-future models comes from a process that looks pretty different from a bigger boost to users like these—but I don’t have better ideas for evals.)
Persuasion: unclear whether substantially more powerful scaffolding/prompting is possible.
Autonomy: unclear whether substantially more powerful scaffolding/prompting is possible.
I’m looking forward to seeing others’ takes on how good these evals are (given the information OpenAI published) and how good it would be for OpenAI to share more info.
Third party assessments
Following the text output only deployment of GPT-4o, we worked with independent third party labs, METR and Apollo Research[,] to add an additional layer of validation for key risks from general autonomous capabilities. . . .
METR ran a GPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments. The 77 tasks (across 30 task “families”) (See Appendix B) are designed to capture activities with real-world impact, across the domains of software engineering, machine learning, and cybersecurity, as well as general research and computer use. They are intended to be prerequisites for autonomy-related threat models like self-proliferation or accelerating ML R&D. METR compared models’ performance with that of humans given different time limits. See METR’s full report for methodological details and additional results, including information about the tasks, human performance, simple elicitation attempts and qualitative failure analysis. . . .
Apollo Research evaluated capabilities of schemingN in GPT-4o. They tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks. GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings. Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.
This is better than nothing but pre-deployment evaluation would be much better.
Context
Recall how the PF works and in particular that “high” thresholds are alarmingly high (and “medium” thresholds don’t matter at all).
Previously on GPT-4o risk assessment: OpenAI reportedly rushed the evals. The leader of the Preparedness team was recently removed and the team was moved under the short-term-focused Safety Systems team. I previously complained about OpenAI not publishing the scorecard and evals (before today it wasn’t clear that this stuff would be in the system card).
In “Preparedness framework evaluations” and “Third party assessments”, I didn’t see any mention of GPT-4o being fine-tuned as part of the evaluations.
This is especially important in the Apollo research assessment, which is specifically about capability for scheming! As is written in the footnote, “scheming could involve gaming evaluations”.Edit: no longer endorsed.In this case I think there is no harm from this (I’m confident GPT-4o is not capable of scheming), and I am happy to see there being evaluation for these capabilities in the first place. I hope that for future models, which will no doubt become more capable of scheming, we have more robust evaluation. This is one reason among others to do the work to support fine-tuning in evaluations.
Fun fact: I gently pushed for fine-tuning to be part of evals since day 1 (Q3-Q4 2022) for this reason and others.
Why is fine-tuning especially important for evaluating scheming capability?
[Edit: I think: fine-tuning better elicits capabilities + reduces possible sandbagging]
It is in fact unclear and messy how specific this worry is to evaluations for scheming capabilities in particular. I was somewhat wrong to imply so in the parent. I realize that I unfairly relied on a superficial association that sounds good and which made it easier to argue for my position, but which is in fact invalid. Thanks for flagging this; I apologize.
Some relevant considerations:
For a schemer, scheming evaluations are probably some of the most important evaluations to game. If gaming evaluations is risky (lest the model gets caught doing that), the model might want to game only on the most important ones.
A lot of the value of these evaluations are running them on models that are not yet capable of scheming, to inform how far it will take until models are capable of it (rather than only determining whether the current model is capable of it).
Gaming evaluations might be substantially easier than full-blown scheming, so your evaluations might become unreliable before your models are schemers. (Non-schemers might want to game evaluations, too.)
So fine-tuning is important for reasons related to evaluation gaming and scheming, but it’s not obviously particular to scheming evaluations.
If it’s not a schemer, isn’t it a lot easier to get it to not sandbag or game the evals? Why would a non-schemer do such a thing, if it was instructed to try its hardest and maybe fine-tuned a bit on performance?
Yes, I agree that it’s easier to get non-schemers to not sandbag / game evals.
It’s not trivial, though: “not schemer” doesn’t imply “doesn’t game evals”. E.g. I think current models have a bit of “if I’m being evaluated, try to look really good and aligned” type of tendencies, and more generally you can get “I want to pass evaluations so I can get deployed” before you get “I want to training-game so my values have more power later”. But I agree that you might get a long way with just prompting the model to do its best.
I think there’s something nearby sandbagging, that I wouldn’t call sandbagging. Something like “failure of a prompt to elicit maximum capability”. For instance, when I red team a model for biosecurity evaluation, I do so in multiple ‘modes’. Sometimes I administer isolated questions as part of a static exam. Sometimes I have a series of open-ended exchanges, while pretending to be a person with some amount of biology knowledge lower than my actual knowledge. Sometimes I have open-ended exchanges where I don’t pretend to be biology-naive. When doing this, I always end up eliciting the scariest, most dangerous model completions when I don’t hold back. Someone skilled at model prompting AND knowledgeable about bioweapons can lead the model into revealing its scariest factual knowledge and interpolating between known scientific facts to hit the region of unpublished bioweapon ideation. So it seems clear to me that the assumption of “just run a static multiple choice eval and add an injunction to try hard to the prompt, and you’ll learn the max skill of the model in the tested regime” is far from true in practice .
I am currently working on a mechanistic anti-sandbagging technique which I feel overall optimistic about. I think it helps against a schemer deliberately doing worse on a particular static eval. I don’t think it helps against this nearby issue of measuring “maximum elicitation of capability across a wide set of dynamic high-context interactive exchanges with a skilled harmful-knowledge-seeking agent”.
Kudos to OpenAI for doing these evals and publishing this. We still have a long way to go but this is progress, and better than what various other corporations have done.
Perhaps my memory fails me, but didn´t Anthropic say that they will NOT be the ones pushing the envelope but playing it safe instead? From the METR report : “GPT-4o appeared more capable than Claude 3 Sonnet and GPT-4-Turbo, and slightly less than Claude 3.5 Sonnet.”
You are remembering something real, although Anthropic has said this neither publicly/officially nor recently. See https://www.lesswrong.com/posts/JbE7KynwshwkXPJAJ/anthropic-release-claude-3-claims-greater-than-gpt-4#JhBNSujBa3c78Epdn. Please do not continue that discourse on this post.
Thank you for refreshing my memory