It’s only now that LLMs are reasonably competent in at least some hard problems
I don’t think that’s the limiter here. Reports in the style of “my unpublished PhD thesis was about doing X using Y methodology, I asked an LLM to do that and it one-shot a year of my work! the equations it derived are correct!” have been around for quite a while. I recall it at least in relation to Claude 3, and more recently, o1-preview.
If LLMs are prompted to combine two ideas, they’ve been perfectly capable of “innovating” for ages now, including at fairly high levels of expertise. I’m sure there’s some sort of cross-disciplinary GPQA-like benchmark that they’ve saturated a while ago, so this is even legible.
The trick is picking which ideas to combine/in what direction to dig. This doesn’t appear to be something LLMs are capable of doing well on their own, nor do they seem to speed up human performance on this task. (All cases of them succeeding at it so far have been, by definition, “searching under the streetlight”: checking whether they can appreciate a new idea that a human already found on their own and evaluated as useful.)
I suppose it’s possible that o3 or its successors change that (the previous benchmarks weren’t measuring that, but surely FrontierMath does...). We’ll see.
I expect RL to basically solve the domain
Mm, I think it’s still up in the air whether even the o-series efficiently scales (as in, without requiring a Dyson Swarm’s worth of compute) to beating the Millennium Prize Eval (or some less legendary yet still major problems).
I expect such problems don’t pass the “can this problem be solved by plugging the extant crystallized-intelligence skills of a number of people into each other in a non-contrived[1] way?” test. Does RL training allow to sidestep this, letting the model generate new crystallized-intelligence skills?
I’m not confident one way or another.
we have another scale-up that’s coming up
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
But I predict, with low-moderate confidence, that it still won’t kick off a deluge of synthetically derived innovations. It’d have even more breadth and eye for nuance, but somehow, perplexingly, still no ability to use those capabilities autonomously.
“Non-contrived” because technically, any cognitive skill is just a combination of e. g. NAND gates, since those are Turing-complete. But obviously that doesn’t mean any such skill is accessible if you’ve learned the NAND gate. Intuitively, a combination of crystallized-intelligence skills is only accessible if the idea of combining them is itself a crystallized-intelligence skill (e. g., in the math case, a known ansatz).
Which perhaps sheds some light on why LLMs can’t innovate even via trivial ideas combinations. If a given idea-combination “template” weren’t present in the training data, the LLM can’t reliably independently conceive of it except by brute-force enumeration...? This doesn’t seem quite right, but maybe in the right direction.
I think my key crux is that in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance, and mathematics/programming are domains that are unusually easy to verify/gather training data for RL performance, so with caveats it can become rather good at those specific domains/benchmarks like millennium prize evals, but the important caveat is I don’t believe this transfers very well to domains where verifying isn’t easy, like creative writing.
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
I was talking about the 1 GW systems that would be developed in late 2026-early 2027, not GPT-5.
in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
I don’t think that’s the limiter here. Reports in the style of “my unpublished PhD thesis was about doing X using Y methodology, I asked an LLM to do that and it one-shot a year of my work! the equations it derived are correct!” have been around for quite a while. I recall it at least in relation to Claude 3, and more recently, o1-preview.
If LLMs are prompted to combine two ideas, they’ve been perfectly capable of “innovating” for ages now, including at fairly high levels of expertise. I’m sure there’s some sort of cross-disciplinary GPQA-like benchmark that they’ve saturated a while ago, so this is even legible.
The trick is picking which ideas to combine/in what direction to dig. This doesn’t appear to be something LLMs are capable of doing well on their own, nor do they seem to speed up human performance on this task. (All cases of them succeeding at it so far have been, by definition, “searching under the streetlight”: checking whether they can appreciate a new idea that a human already found on their own and evaluated as useful.)
I suppose it’s possible that o3 or its successors change that (the previous benchmarks weren’t measuring that, but surely FrontierMath does...). We’ll see.
Mm, I think it’s still up in the air whether even the o-series efficiently scales (as in, without requiring a Dyson Swarm’s worth of compute) to beating the Millennium Prize Eval (or some less legendary yet still major problems).
I expect such problems don’t pass the “can this problem be solved by plugging the extant crystallized-intelligence skills of a number of people into each other in a non-contrived[1] way?” test. Does RL training allow to sidestep this, letting the model generate new crystallized-intelligence skills?
I’m not confident one way or another.
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
But I predict, with low-moderate confidence, that it still won’t kick off a deluge of synthetically derived innovations. It’d have even more breadth and eye for nuance, but somehow, perplexingly, still no ability to use those capabilities autonomously.
“Non-contrived” because technically, any cognitive skill is just a combination of e. g. NAND gates, since those are Turing-complete. But obviously that doesn’t mean any such skill is accessible if you’ve learned the NAND gate. Intuitively, a combination of crystallized-intelligence skills is only accessible if the idea of combining them is itself a crystallized-intelligence skill (e. g., in the math case, a known ansatz).
Which perhaps sheds some light on why LLMs can’t innovate even via trivial ideas combinations. If a given idea-combination “template” weren’t present in the training data, the LLM can’t reliably independently conceive of it except by brute-force enumeration...? This doesn’t seem quite right, but maybe in the right direction.
I think my key crux is that in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance, and mathematics/programming are domains that are unusually easy to verify/gather training data for RL performance, so with caveats it can become rather good at those specific domains/benchmarks like millennium prize evals, but the important caveat is I don’t believe this transfers very well to domains where verifying isn’t easy, like creative writing.
I was talking about the 1 GW systems that would be developed in late 2026-early 2027, not GPT-5.
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
I don’t think we know yet whether it will succeed in practice, or whether it training costs make it infeasibble to do.