My probably contrarian take is that I don’t think improvement on a benchmark of math problems is particularly scary or relevant. It’s not nothing—I’d prefer if it didn’t improve at all—but it only makes me slightly more worried.
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still 1⁄10 and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
Will write a reply to this comment when I can test it.
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
My probably contrarian take is that I don’t think improvement on a benchmark of math problems is particularly scary or relevant. It’s not nothing—I’d prefer if it didn’t improve at all—but it only makes me slightly more worried.
can you say more about your reasoning for this?
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still 1⁄10 and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
Will write a reply to this comment when I can test it.
can you say the types of problems they are?
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
What’s the last model you did check with, o1-pro?
Just regular o1, I have the 20$/month subscription not the 200$/month
Do you have a link to these?
What benchmarks (or other capabilities) do you see as more relevant, and how worried were you before?