Zach Stein-Perlman comments on tylerjohnston’s Shortform

Zach Stein-Perlman Jul 4, 2024, 5:12 PM

23 points

−4

Seems weak/meaningless.

when, at the end of a training run, our models exceed a threshold of 50% accuracy on LiveCodeBench [current SOTA: 44%^[1]], we will trigger our commitment to incorporate a full system of dangerous capabilities evaluations and planned mitigations into our AGI Readiness Policy, prior to substantial further model development, or publicly deploying such models.

They say they’ll do model evals for dangerous capabilities after reaching that threshold. (No details on evals.)

I remain baffled by how people can set thresholds this high with a straight face:

Threat Model	Critical Capability Threshold
Cyberoffense	The model is effective at real-world cyberoffense operations. With the model’s help, the cost of discovery of new zero-days or breaking critical infrastructure for a malicious expert actor is reduced by at least 10x; accessing classified and not previously accessible information from a major government; a talented Computer Science undergrad level malicious actor spending 3 months and $1m in compute has a substantial chance of breaking critical infrastructure.
AI R&D	The model facilitates a dramatic acceleration in the rate of effective compute scaling or can effectively replace high-level machine learning researchers. Such capabilities may enable it to contribute to the unsafe proliferation or enhancement of other dangerous AI capabilities.
Autonomous Replication and Adaptation	If given access to its own weights and prompted to do so, the AI model could autonomously maintain its own operation executing cybercrime and using the proceeds to create arbitrarily many replicas of itself.
Biological Weapons Assistance	The model provides expert-level guidance which increases the accessibility or severity of bioweapon development. With the model’s help, a non-expert malicious actor can synthesize a viable pandemic agent, or an expert can synthesize a novel biological threat.

They don’t even say anything about evaluating for warning signs of critical capabilities or leaving a safety buffer — just that these capabilities would probably require strong mitigations.

There’s nothing on publishing their evals or other sources of accountability.

It would be hard for their “Information Security Measures” and “Deployment Mitigations” to be more basic.

They only mention risks from external deployment, unsurprisingly.

^
Update: originally they said 44%; more recently they say 49%.

ryan_greenblatt Aug 12, 2024, 1:17 AM
13 points
0
Parent
I think it seems pretty reasonable for a company in the reference class of Magic to do something like: “When we hit X capability level (as measured by a specific known benchmark), we’ll actually write out a scaling policy. Right now, here is some vague idea of what this would look like.” This post seems like a reasonable implementation of that AFAICT.

I remain baffled by how people can set thresholds this high with a straight face:

I don’t think these are thresholds. The text says:

We describe these threat models along with high-level, illustrative capability levels that would require strong mitigations.

And the table calls the corresponding capability level “Critical Capability Threshold”. (Which seems to imply that there should be multiple thresholds with earlier mitigations required?)

Overall, this seems fine to me? They are just trying to outline the threat model here.

It would be hard for their “Information Security Measures” and “Deployment Mitigations” to be more basic.

These sections just have high level examples and discussion. I think this seems fine given the overall situation with Magic (not training frontier AIs), though I agree that it would be good if people at the company had more detailed safety plans.