About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still 1⁄10 and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
Will write a reply to this comment when I can test it.
I’m pretty shocked by this result. Less because the 2⁄10 number itself, but by the specific one it solved. My P(LLMs can scale to AGI) increased significantly, although not to 50%.
o3-mini-high gets 3⁄10; this is essentially the same as DeepSeek (there were two where DeepSeek came very close, this is one of them). I’m still slightly more impressed with DeepSeek despite the result, but it’s very close.
I can say more about my model now. The way I’d put it now (h/t Steven Byrnes) is that there are three interesting classes of capabilities
A: sequential reasoning of any kind
B: sequential reasoning on topics where steps aren’t easily verifiable
C: the type of thing Steven mentions here, like coming up with new abstractions/concepts to integrate into your vocabulary to better think about something
Among these, obviously B is a subset of A. And while it’s not obvious, I think C is probably best viewed as a subset of B. Regardless, I think all three are required for what I’d call AGI. (This is also how I’d justify the claim that no current LLM is AGI.) Maybe C isn’t strictly required, I could imagine a mind getting superhuman performance without it, but I think given how LLMs work otherwise, it’s not happening.
Up until DeepSeek, I would have also said LLMs are terrible A. (This is probably a hot take, but I genuinely think it’s true despite benchmark performances continuing to go up.) My tasks were designed to test A, with the hypothesis that LLMs will suck at A indefinitely. For a while, it seemed like people weren’t even focusing on A, which is why I didn’t want to talk a bout it. But this concern is no longer applicable; the new models are clearly focused on improving sequential reasoning. However, o1 was terrible at it (imo), almost no improvement form GPT-4 proper, so I actually found o1 reassuring.
This has now mostly been falsified with DeepSeek and o3. (I know the numbers don’t really tell the story since it just went from 1 to 2, but like, including which stuff they solved and how they argue, DeepSeek was the where I went “oh shit they can actually do legit sequential reasoning now”.) Now I’m expecting most of the other tasks to fall as well, so I won’t do similar updates if it goes to 5⁄10 or 8⁄10. The hypothesis “A is an insurmountable obstacle” can only be falsified once.
That said, it still matters how fast they improve. How much it matters depends on whether you think better performance on A is progress toward B/C. I’m still not sure about this, I’m changing my views a lot right now. So idk. If they score 10⁄10 in the next year, my p(LLMs scale to AGI) will definitely go above 50%, probably if they do it in 3 years as well, but that’s about the only thing I’m sure about.
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still 1⁄10 and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
Will write a reply to this comment when I can test it.
Deepseek gets 2⁄10.
I’m pretty shocked by this result. Less because the 2⁄10 number itself, but by the specific one it solved. My P(LLMs can scale to AGI) increased significantly, although not to 50%.
o3-mini-high gets 3⁄10; this is essentially the same as DeepSeek (there were two where DeepSeek came very close, this is one of them). I’m still slightly more impressed with DeepSeek despite the result, but it’s very close.
What score would it take for you to update your p(LLMs scale to AGI) above 50%?
Tricky to answer actually.
I can say more about my model now. The way I’d put it now (h/t Steven Byrnes) is that there are three interesting classes of capabilities
A: sequential reasoning of any kind
B: sequential reasoning on topics where steps aren’t easily verifiable
C: the type of thing Steven mentions here, like coming up with new abstractions/concepts to integrate into your vocabulary to better think about something
Among these, obviously B is a subset of A. And while it’s not obvious, I think C is probably best viewed as a subset of B. Regardless, I think all three are required for what I’d call AGI. (This is also how I’d justify the claim that no current LLM is AGI.) Maybe C isn’t strictly required, I could imagine a mind getting superhuman performance without it, but I think given how LLMs work otherwise, it’s not happening.
Up until DeepSeek, I would have also said LLMs are terrible A. (This is probably a hot take, but I genuinely think it’s true despite benchmark performances continuing to go up.) My tasks were designed to test A, with the hypothesis that LLMs will suck at A indefinitely. For a while, it seemed like people weren’t even focusing on A, which is why I didn’t want to talk a bout it. But this concern is no longer applicable; the new models are clearly focused on improving sequential reasoning. However, o1 was terrible at it (imo), almost no improvement form GPT-4 proper, so I actually found o1 reassuring.
This has now mostly been falsified with DeepSeek and o3. (I know the numbers don’t really tell the story since it just went from 1 to 2, but like, including which stuff they solved and how they argue, DeepSeek was the where I went “oh shit they can actually do legit sequential reasoning now”.) Now I’m expecting most of the other tasks to fall as well, so I won’t do similar updates if it goes to 5⁄10 or 8⁄10. The hypothesis “A is an insurmountable obstacle” can only be falsified once.
That said, it still matters how fast they improve. How much it matters depends on whether you think better performance on A is progress toward B/C. I’m still not sure about this, I’m changing my views a lot right now. So idk. If they score 10⁄10 in the next year, my p(LLMs scale to AGI) will definitely go above 50%, probably if they do it in 3 years as well, but that’s about the only thing I’m sure about.
Any chance you can post (or PM me) the three problems AIs have already beaten?
can you say the types of problems they are?
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
What’s the last model you did check with, o1-pro?
Just regular o1, I have the 20$/month subscription not the 200$/month
Do you have a link to these?