I’m very confused about current AI capabilities and I’m also very confused why other people aren’t as confused as I am. I’d be grateful if anyone could clear up either of these confusions for me.
How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?
For example:
O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
If either of these statements is false (they might be—I haven’t been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?
If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.
Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.
I don’t think o3 is well described as superhuman—it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.
(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)
Additionally, I’d say that some of the obstacles in outputing a good research paper could be resolved with some schlep, so I wouldn’t be surprised if we see some OK research papers being output (with some human assistance) next year.
I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.
We’re only familiar with human minds, in which there’s a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we’re generating other minds/proto-minds, and we’re discovering that this coupling isn’t fundamental.
(This is an argument for longer timelines, by the way. Current AIs feel on the very cusp of being AGI, but there in fact might be some vast gulf between their algorithms and human-brain algorithms that we just don’t know how to talk about.)
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
I don’t think that’s strictly true, the peer-review system often approves utter nonsense. But yes, I don’t think any AI system can generate an actually worthwhile research paper.
Reliability is way more important than people realized. One of the central problems that hasn’t gone away as AI scaled is that their best performance is too unreliable for anything but very easy to verify problems like mathematics and programming, which prevents unreliability from becoming crippling, but otherwise this is the key blocker that standard AI scaling has basically never solved.
It’s possible in practice to disentangle certain capabilities from each other, and in particular math and programming capabilities do not automatically imply other capabilities, even if we somehow had figured out how to make the o-series as good as AlphaZero for math and programming, which is good news for AI control.
The AGI term, and a lot of the foundation built off of it, like timelines to AGI, will become less and less relevant over time, because of both the varying meanings, combined with the fact that as AI progresses, capabilities will be developed in a different order from humans, meaning a lot of confusion is on the way, and we’d need different metrics.
We should expect that AI that automates AI research/the economy to look more like Deep Blue/brute-forcing a problem/having good execution skills than AIs like AlphaZero that use very clean/aesthetically beautiful algorithmic strategies.
Reliability is way more important than people realized
Yes, but whence human reliability? What makes humans so much more reliable than the SotA AIs? What are AIs missing? The gulf in some cases is so vast it’s a quantity-is-a-quality-all-its-own thing.
1 is that the structure of jobs is shaped to accommodate human unreliability by making mistakes less fatal.
2 is that while humans themselves aren’t reliable, their algorithms almost certainly are more powerful at error detection and correction, so the big thing AI needs to achieve is the ability to error-correct or become more reliable.
There’s also the fact that humans are better at sample efficiency than most LLMs, but that’s a more debatable proposition.
the structure of jobs is shaped to accommodate human unreliability by making mistakes less fatal
Mm, so there’s a selection effect on the human end, where the only jobs/pursuits that exist are those which humans happen to be able to reliably do, and there’s a discrepancy between the things humans and AIs are reliable at, so we end up observing AIs being more unreliable, even though this isn’t representative of the average difference between the human vs. AI reliability across all possible tasks?
I don’t know that I buy this. Humans seem pretty decent at becoming reliable at ~anything, and I don’t think we’ve observed AIs being more-reliable-than-humans at anything? (Besides trivial and overly abstract tasks such as “next-token prediction”.)
My claim was more along the lines of if an unaided human can’t do a job safely or reliably, as was almost certainly the case 150-200 years ago, if not more years in the past, we make the jobs safer using tools such that human error is way less of a big deal, and AIs currently haven’t used tools that increased their reliability.
Remember, it took a long time for factories to be made safe, and I’d expect a similar outcome for driving, so while I don’t think 1 is everything, I do think it’s a non-trivial portion of the reliability difference.
O3 scores higher on FrontierMath than the top graduate students
I’d guess that’s basically false. In particular, I’d guess that:
o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
the amount of time humans spend on the problem is a big factor—human performance has compounding returns on the scale of hours invested, whereas o3′s performance basically doesn’t have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don’t have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.
I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.
Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff data, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)
I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically “apply the Weil conjectures”. These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn’t have solved the problem this way, since I wouldn’t have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are also many mathematicians in other areas who just don’t know this.
In order to apply the Weil conjectures, you have to recognize that they are relevant, know what they say, and do some routine calculation. As I suggested, the Weil conjectures are a very natural subject to have a problem about. If you know anything about the Weil conjectures, you know that they are about counting points of varieties over a finite field, which is straightforwardly what the problems asks. Further, this is the simplest case, that of a curve, which is e.g. what you’d see as an example in an introduction to the subject.
Regarding the calculation, parts of it are easier if you can run some code, but basically at this point you’ve following a routine pattern. There are definitely many examples of someone working out what the Weil conjectures say for some curve in the training set.
Further, asking Claude a bit, it looks like 518±6⋅59+1 are particularly common cases here. So, if you skip some of the calculation and guess, or if you make a mistake, you have a decent chance of getting the right answer by luck. You still need the sign on the middle term, but that’s just one bit of information. I don’t understand this well enough to know if there’s a shortcut here without guessing.
Overall, I feel that the benchmark has been misrepresented. If this problem is representative, it seems to test broad factual knowledge of advanced mathematics more than problem-solving ability. Of course, this question is marked as the easiest of the listed ones. Daniel Litt says something like this about some other problems as well, but I don’t really understand how routine he’s saying that they are, are I haven’t tried to understand the solutions myself.
Not a genius. The point isn’t that I can do the problems, it’s that I can see how to get the solution instantly, without thinking, at least in these examples. It’s basically a test of “have you read and understood X.” Still immensely impressive that the AI can do it!
First, reasoning at a vague level about “impressiveness” just doesn’t and shouldn’t be expected to work. Because 2024 AIs don’t do things the way humans do, they’ll generalize different, so you can’t make inferences between “it can do X” to “it can do Y” like you can with humans:
There is a broken inference. When talking to a human, if the human emits certain sentences about (say) category theory, that strongly implies that they have “intuitive physics” about the underlying mathematical objects. They can recognize the presence of the mathematical structure in new contexts, they can modify the idea of the object by adding or subtracting properties and have some sense of what facts hold of the new object, and so on. This inference——emitting certain sentences implies intuitive physics——doesn’t work for LLMs.
Second, 2024 is specifically trained on short, clear, measurable tasks. Those tasks also overlap with legible stuff—stuff that’s easy for humans to check. In other words, they are, in a sense, specifically trained to trick your sense of how impressive they are—they’re trained on legible stuff, with not much constraint on the less-legible stuff (and in particular, on the stuff that becomes legible but only in total failure on more difficult / longer time-horizon stuff).
The broken inference is broken because these systems are optimized for being able to perform all the tasks that don’t take a long time, are clearly scorable, and have lots of data showing performance. There’s a bunch of stuff that’s really important——and is a key indicator of having underlying generators of understanding——but takes a long time, isn’t clearly scorable, and doesn’t have a lot of demonstration data. But that stuff is harder to talk about and isn’t as intuitively salient as the short, clear, demonstrated stuff.
I’m very confused about current AI capabilities and I’m also very confused why other people aren’t as confused as I am. I’d be grateful if anyone could clear up either of these confusions for me.
How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?
For example:
O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
If either of these statements is false (they might be—I haven’t been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?
If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.
Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.
I don’t think o3 is well described as superhuman—it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.
(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)
Additionally, I’d say that some of the obstacles in outputing a good research paper could be resolved with some schlep, so I wouldn’t be surprised if we see some OK research papers being output (with some human assistance) next year.
I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.
We’re only familiar with human minds, in which there’s a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we’re generating other minds/proto-minds, and we’re discovering that this coupling isn’t fundamental.
(This is an argument for longer timelines, by the way. Current AIs feel on the very cusp of being AGI, but there in fact might be some vast gulf between their algorithms and human-brain algorithms that we just don’t know how to talk about.)
I don’t think that’s strictly true, the peer-review system often approves utter nonsense. But yes, I don’t think any AI system can generate an actually worthwhile research paper.
I think the main takeaways are the following:
Reliability is way more important than people realized. One of the central problems that hasn’t gone away as AI scaled is that their best performance is too unreliable for anything but very easy to verify problems like mathematics and programming, which prevents unreliability from becoming crippling, but otherwise this is the key blocker that standard AI scaling has basically never solved.
It’s possible in practice to disentangle certain capabilities from each other, and in particular math and programming capabilities do not automatically imply other capabilities, even if we somehow had figured out how to make the o-series as good as AlphaZero for math and programming, which is good news for AI control.
The AGI term, and a lot of the foundation built off of it, like timelines to AGI, will become less and less relevant over time, because of both the varying meanings, combined with the fact that as AI progresses, capabilities will be developed in a different order from humans, meaning a lot of confusion is on the way, and we’d need different metrics.
Tweet below:
https://x.com/ObserverSuns/status/1511883906781356033
We should expect that AI that automates AI research/the economy to look more like Deep Blue/brute-forcing a problem/having good execution skills than AIs like AlphaZero that use very clean/aesthetically beautiful algorithmic strategies.
Yes, but whence human reliability? What makes humans so much more reliable than the SotA AIs? What are AIs missing? The gulf in some cases is so vast it’s a quantity-is-a-quality-all-its-own thing.
I have 2 answers to this.
1 is that the structure of jobs is shaped to accommodate human unreliability by making mistakes less fatal.
2 is that while humans themselves aren’t reliable, their algorithms almost certainly are more powerful at error detection and correction, so the big thing AI needs to achieve is the ability to error-correct or become more reliable.
There’s also the fact that humans are better at sample efficiency than most LLMs, but that’s a more debatable proposition.
Mm, so there’s a selection effect on the human end, where the only jobs/pursuits that exist are those which humans happen to be able to reliably do, and there’s a discrepancy between the things humans and AIs are reliable at, so we end up observing AIs being more unreliable, even though this isn’t representative of the average difference between the human vs. AI reliability across all possible tasks?
I don’t know that I buy this. Humans seem pretty decent at becoming reliable at ~anything, and I don’t think we’ve observed AIs being more-reliable-than-humans at anything? (Besides trivial and overly abstract tasks such as “next-token prediction”.)
(2) seems more plausible to me.
My claim was more along the lines of if an unaided human can’t do a job safely or reliably, as was almost certainly the case 150-200 years ago, if not more years in the past, we make the jobs safer using tools such that human error is way less of a big deal, and AIs currently haven’t used tools that increased their reliability.
Remember, it took a long time for factories to be made safe, and I’d expect a similar outcome for driving, so while I don’t think 1 is everything, I do think it’s a non-trivial portion of the reliability difference.
More here:
https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe
I’d guess that’s basically false. In particular, I’d guess that:
o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
the amount of time humans spend on the problem is a big factor—human performance has compounding returns on the scale of hours invested, whereas o3′s performance basically doesn’t have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don’t have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.
I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.
Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff data, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)
Maybe you want:
Though worth noting here that the AI is using best of K and individual trajectories saturate without some top-level aggregation scheme.
It might be more illuminating to look at labor cost vs performance which looks like:
I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically “apply the Weil conjectures”. These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn’t have solved the problem this way, since I wouldn’t have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are also many mathematicians in other areas who just don’t know this.
In order to apply the Weil conjectures, you have to recognize that they are relevant, know what they say, and do some routine calculation. As I suggested, the Weil conjectures are a very natural subject to have a problem about. If you know anything about the Weil conjectures, you know that they are about counting points of varieties over a finite field, which is straightforwardly what the problems asks. Further, this is the simplest case, that of a curve, which is e.g. what you’d see as an example in an introduction to the subject.
Regarding the calculation, parts of it are easier if you can run some code, but basically at this point you’ve following a routine pattern. There are definitely many examples of someone working out what the Weil conjectures say for some curve in the training set.
Further, asking Claude a bit, it looks like 518±6⋅59+1 are particularly common cases here. So, if you skip some of the calculation and guess, or if you make a mistake, you have a decent chance of getting the right answer by luck. You still need the sign on the middle term, but that’s just one bit of information. I don’t understand this well enough to know if there’s a shortcut here without guessing.
Overall, I feel that the benchmark has been misrepresented. If this problem is representative, it seems to test broad factual knowledge of advanced mathematics more than problem-solving ability. Of course, this question is marked as the easiest of the listed ones. Daniel Litt says something like this about some other problems as well, but I don’t really understand how routine he’s saying that they are, are I haven’t tried to understand the solutions myself.
Pulling a quote from the tweet replies (https://x.com/littmath/status/1870560016543138191):
I don’t know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.
First, reasoning at a vague level about “impressiveness” just doesn’t and shouldn’t be expected to work. Because 2024 AIs don’t do things the way humans do, they’ll generalize different, so you can’t make inferences between “it can do X” to “it can do Y” like you can with humans:
Second, 2024 is specifically trained on short, clear, measurable tasks. Those tasks also overlap with legible stuff—stuff that’s easy for humans to check. In other words, they are, in a sense, specifically trained to trick your sense of how impressive they are—they’re trained on legible stuff, with not much constraint on the less-legible stuff (and in particular, on the stuff that becomes legible but only in total failure on more difficult / longer time-horizon stuff).