I’m the chief scientist at Redwood Research.
ryan_greenblatt
There was a graph floating around which showed this pretty clearly, but I don’t have it on hand at the moment.
Maybe you want:
Though worth noting here that the AI is using best of K and individual trajectories saturate without some top-level aggregation scheme.
It might be more illuminating to look at labor cost vs performance which looks like:
Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.
I don’t think o3 is well described as superhuman—it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.
(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)
Additionally, I’d say that some of the obstacles in outputing a good research paper could be resolved with some schlep, so I wouldn’t be surprised if we see some OK research papers being output (with some human assistance) next year.
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
If the collusion is reasoned about in CoT, it should be relatively easy to catch and prevent this at deployment time.
(Studying the collusion abilities in CoT of reasoning type systems still seems interesting.)
By top 3 priority, I mean “among the top 3 most prioritized cyber attacks of that year”. Precisely, I’m discussing robustness against OC5 as defined in the RAND report linked above:
OC5 Top-priority operations by the top cyber-capable institutions
Operations roughly less capable than or comparable to 1,000 individuals who have experience and expertise years ahead of the (public) state of the art in a variety of relevant professions (cybersecurity, human intelligence gathering, physical operations, etc.) spending years with a total budget of up to $1 billion on the specific operation, with state-level infrastructure and access developed over decades and access to state resources such as legal cover, interception of communication infrastructure, and more.
This includes the handful of operations most prioritized by the world’s most capable nation-states.
Emphasis mine.
I’d say control seems easier now, but it’s unclear if this makes the agenda more promising. You might have thought one issue with the agenda is that control is likely to be trivial and thus not worth working on (and that some other problem, e.g., doing alignment research with AI labor regardless of whether the AIs are scheming is a bigger issue).
Sure, by “state-proof” security, I mean that even a top priority (e.g top 3 priority) chinese effort would likely (80-90%) fail to steal the weights in 2 years. SL5 from the RAND report roughly corresponds to this.
We might want something something stronger than this shortly after as we might need to resist an unprecedentedly high effort attack.
The word “fully” is maybe a bit sloppy, I’ve cut it from the above comment.
Yeah, sorry this is an important caveat. But, I think very superhuman performance in most/all checkable domains is pretty spooky and this is even putting aside how it generalizes.
progress from o1 to o3 was only three months
Can’t we just count from announcement to announcement? Like sure, they were working on stuff before o1 prior to having o1 work, but they are always going to be working on the next thing.
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
If your answers differ (including quantitatively), why?
The main exception is that o3 might be based on a different base model which could imply that a bunch of the gains are from earlier scaling.
I think the best bull case is something like:
They did this pretty quickly and were able to greatly improve performance on a moderately diverse range of pretty checkable tasks. This implies OpenAI likely has an RL pipeline which can be scaled up to substantially better performance by putting in easily checkable tasks + compute + algorithmic improvements. And, given that this is RL, there isn’t any clear reason this won’t work (with some additional annoyances) for scaling through very superhuman performance (edit: in these checkable domains).[1]
Credit to @Tao Lin for talking to me about this take.
IMO, it’s a bit unclear how this effects future expectations of inflation, but yeah, I agree.
I’m an author on the paper. I think the experiment you’re proposing would be something like:
Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.
We didn’t run this experiment, though the results would be interesting.
You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.
How people discuss the US national debt is an interesting case study of misleading usage of the wrong statistic. The key thing is that people discuss the raw debt amount and the rate at which that is increasing, but you ultimately care about the relationship between debt and US gdp (or US tax revenue).
People often talk about needing to balance the budget, but actually this isn’t what you need to ensure[1], to manage debt it suffices to just ensure that debt grows slower than US GDP. (And in fact, the US has ensured this for the past 4 years as debt/GDP has decreased since 2020.)
To be clear, it would probably be good to have a national debt to GDP ratio which is less than 50% rather than around 120%.
- ^
There are some complexities with inflation because the US could inflate its way out of dollar dominated debt and this probably isn’t a good way to do things. But, with an independent central bank this isn’t much of a concern.
- ^
Donating to the LTFF seems good.
A breakdown of AI capability levels focused on AI R&D labor acceleration
This plan seems to underemphasize security. I expect that for 10x AI R&D[1], you strongly want state proof security (SL5) against weight exfiltration and then quickly after that you want this level of security for algorithmic secrets and unauthorized internal usage[2][3].
Things don’t seem to be on track for this level of security so I expect a huge scramble to achieve this.
- ↩︎
10x AI R&D could refer to “software progress in AI is now 10x faster” or “the labor input to software progress is now 10x faster”. See discussion here. If it’s the second one, it is plausible that fully state proof security isn’t the most important thing if AI progress is mostly bottlenecked by other factors. However, 10x labor acceleration is pretty crazy and I think you want SL5.
- ↩︎
Unauthorized internal usage includes stuff like foreign adversaries doing their weapons R&D on your cluster using your model weights.
- ↩︎
SL5 is just robustness to top priority attacks, but you might need robustness to unprecedentedly high effort attack shortly after this, so you’ll potentially need to transition to unprecedentedly high levels of security.
- ↩︎
scaring laws
lol
As in, for the literal task of “solve this code forces problem in 30 minutes” (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn’t cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.)
Is this what you mean?
My predictions are looking pretty reasonable, maybe a bit underconfident in AI progress.
70% probability: A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs).
With $20 per task, it looks like o3 is matching MTurk performance on the semi-private set and solving it on the public set. This likely depended on other advances in RL and a bunch of other training, but probably much less than $10 million + 3 top ML researchers + year was dedicated to ARC-AGI in particular.
I wasn’t expecting OpenAI to specifically try on ARC-AGI, so I wasn’t expecting this level of performance this fast (and I lost some mana due to this).
35% probability: Under the above conditions, 85% on the test set would be achieved. It’s unclear which humans perform at >=85% on the test set, though this is probably not that hard for smart humans.
Looks like o3 is under this, even with $100 per problem (as o3 high compute is just barely over and is 172x compute). Probably I should have been higher than 35% depending on how we count transfer from other work on RL etc.
80% probability: next generation multi-model models (e.g. GPT-5) will be able to substantially advance performance on ARC-AGI.
Seems clearly true if we count o3 as a next generation multi-model model. Idk how we should have counted o1, though I also think this arguably substantially advanced performance.
I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.
Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)