AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.
Zach Stein-Perlman
I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)
I think they only have formal high and low versions for o3-mini
Done, thanks.
I already edited out most of the “like”s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn’t exact. You are free to post your own version but not to edit mine.
Edit: actually I did another pass and edited out several more; thanks for the nudge.
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
It was one submission, apparently.
and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning
The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
The obvious boring guess is best of n. Maybe you’re asserting that using $4,000 implies that they’re doing more than that.
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.
Edit: debate is a technique for iterated amplification—but that tag is terrible too, oh no
I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, −1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic’s RSP was published. It’s since become clear that nobody knows how to write an RSP that’s predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.
This is the classic paper on model evals for dangerous capabilities.
On a skim, it’s aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends “alignment evaluations” to determine models’ propensity for misalignment, but such evals can’t really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.
Interesting point, written up really really well. I don’t think this post was practically useful for me but it’s a good post regardless.
This post helped me distinguish capabilities-y information that’s bad to share from capabilities-y information that’s fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet said it has operationalized thresholds/buffers).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals were rushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
- ^
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
- ^
This early control post introduced super important ideas: trusted monitoring plus the general point
if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.
Briefly:
For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn’t feel so high.
For Anthropic, most of the action is at ASL-4+, and they haven’t even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don’t think “The thresholds are very high” for OpenAI was meant to imply a comparison to Anthropic; it’s hard to compare since ASL-4 doesn’t exist. Sorry for confusion.)
Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: “I don’t like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don’t think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn’t very truth-tracking.” And I disagree with habryka’s claims that the effect of FTX is obvious.)
practically all metrics of the EA community’s health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.
I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?
Edit: relevant thing habryka said that I didn’t quote above:
For the EA surveys, these indicators looked very bleak:
“Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community”
“Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = −1.5, with large uncertainty) among those who were additionally aware of FTX.”
“Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership.”
I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).