Thanks for sharing, I hadn’t seen those yet! I’ve had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.
How much does o1-preview update your view? It’s much better at Blocksworld for example.
Quite substantially. Substantially enough that I’ll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati’s) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).
There’s one sense in which it’s not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it’s more like Ryan’s hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn’t really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.
Here’s a first pass on how much this changes my numeric probabilities—I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):
LLMs continue to do better at block world and ARC as they scale: 75% → 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan’s: 10% → 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like ‘the full o1 turns out to already be close to the grand prize mark’ and the rest on ‘OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed’.
Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60% → 75% -- I’m tempted to put it higher, but it wouldn’t be that surprising if o1-mark-2 didn’t quite get there even with scaffolding/tools, especially since we don’t have clear insight into how much harder the full test set is.
Same but for the gen after that (GPT-6, Claude 5): 75% → 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I’m reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 → o2 jump doesn’t focus on optimizing for general reasoning, etc. It seems very plausible that I’ll bump this higher on reflection.
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9] → 80%. That sure does seem like the world we’re living in. It’s not clear to me that o1 couldn’t already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.
Now that I’ve seen these, I’m going to have to think hard about whether my upcoming research projects in this area (including one I’m scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.
Thanks again, I really appreciate you drawing my attention to these.
Thanks for sharing, I hadn’t seen those yet! I’ve had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.
Quite substantially. Substantially enough that I’ll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati’s) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).
There’s one sense in which it’s not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it’s more like Ryan’s hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn’t really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.
Here’s a first pass on how much this changes my numeric probabilities—I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):
LLMs continue to do better at block world and ARC as they scale: 75% → 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan’s: 10% → 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like ‘the full o1 turns out to already be close to the grand prize mark’ and the rest on ‘OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed’.
Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60% → 75% -- I’m tempted to put it higher, but it wouldn’t be that surprising if o1-mark-2 didn’t quite get there even with scaffolding/tools, especially since we don’t have clear insight into how much harder the full test set is.
Same but for the gen after that (GPT-6, Claude 5): 75% → 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I’m reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 → o2 jump doesn’t focus on optimizing for general reasoning, etc. It seems very plausible that I’ll bump this higher on reflection.
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9] → 80%. That sure does seem like the world we’re living in. It’s not clear to me that o1 couldn’t already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.
Now that I’ve seen these, I’m going to have to think hard about whether my upcoming research projects in this area (including one I’m scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.
Thanks again, I really appreciate you drawing my attention to these.
I’ve now expanded this comment to a post—mostly the same content but with more detail.
https://www.lesswrong.com/posts/wN4oWB4xhiiHJF9bS/llms-look-increasingly-like-general-reasoners