AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.
Zach Stein-Perlman
All of the founders committed to donate 80% of their equity. I heard it’s set aside in some way but they haven’t donated anything yet. (Source: an Anthropic human.)
This fact wasn’t on the internet, or rather at least wasn’t easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela’s equity is pledged.
I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is
it would be undignified if AI takes over because we didn’t really try off-policy probes; maybe they just work really well; someone should figure that out
It’s playful and unserious but “X would be undignified” roughly means “it would be an unfortunate error if we did X or let X happen” and is used in the context of AI doom and our ability to affect P(doom).
edit: wait likely it’s RL; I’m confused
OpenAI didn’t fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
we didn’t go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on “tuned”: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn’t generate synthetic ARC data to improve their score.
An OpenAI staff member replied
Correct, can confirm “targeting” exclusively means including a (subset of) the public training set.
and further confirmed that “tuned” in the graph is
a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.
Another OpenAI staff member said
also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
[heavily edited after first posting]
Welcome!
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can’t naively translate benchmark scores to real-world capabilities.
I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).
Edit: maybe I don’t know what he’s referring to.
I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)
I think they only have formal high and low versions for o3-mini
Edit: nevermind idk
Done, thanks.
I already edited out most of the “like”s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn’t exact. You are free to post your own version but not to edit mine.
Edit: actually I did another pass and edited out several more; thanks for the nudge.
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
It was one submission, apparently.
and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning
The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
The obvious boring guess is best of n. Maybe you’re asserting that using $4,000 implies that they’re doing more than that.
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.
Edit: debate is a technique for iterated amplification—but that tag is terrible too, oh no
I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, −1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic’s RSP was published. It’s since become clear that nobody knows how to write an RSP that’s predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.
This is the classic paper on model evals for dangerous capabilities.
On a skim, it’s aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends “alignment evaluations” to determine models’ propensity for misalignment, but such evals can’t really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.
Interesting point, written up really really well. I don’t think this post was practically useful for me but it’s a good post regardless.
Some other prompts I use when being a [high-effort body double / low-effort metacognitive assistant / rubber duck]:
What are you doing?
What’s your goal?
Or: what’s your goal for the next n minutes?
Or: what should be your goal?
Are you stuck?
Follow-ups if they’re stuck:
what should you do?
can I help?
have you considered asking someone for help?
If I don’t know who could help, this is more like prompting them to figure out who could help; if I know the manager/colleague/friend who they should ask, I might use that person’s name
Maybe you should x
If someone else was in your position, what would you advise them to do?