If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years. To the extent o1-like post-training enables something like System 2 reasoning, humans seem like a reasonable anchor for such plateaus. Larger LLMs generate about 100 output tokens per second (counting speculative decoding; processing of input tokens parallelizes). A human, let’s say 8 hours a day and 1 token per second, is 300 times slower.
Thus models that can solve a difficult problem at all, will probably be able to do so faster than humans (in physical time). For a model that’s AGI, this likely translates into acceleration of AI research, enabling thinking of useful thoughts even faster.
If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years
Since the scaling is logarithmic, your example seems to be a strawman.
The real claim debated is more something like:
“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months”
And this formulation doesn’t seem obviously true.
What I mean[1] is that it seems unlikely relative to what the scale implies, the graph on the log-scale levels off before it gets there. This claim depends on the existence of a reference human who solves the problem in 1 month, since there are some hard problems that take 30 years, but those aren’t relevant to the claim, since it’s about the range of useful slowdowns relative to human effort. The 1-month human remains human on the other side of the analogy, so doesn’t get impossible levels of starting knowledge, instead it’s the 20-year-failing human who becomes a 200-million-token-failing AI that fails despite a knowledge advantage.
“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months”
That is another implied claim, though it’s not actually observable as evidence, and requires the 10,000 months to pass without advancements in relevant externally generated science (which is easier to imagine for 20 years with a sufficiently obscure problem). Progress like that is possible for sufficiently capable humans, but then I think there won’t be an even more capable human that solves it in 1 month. The relevant AIs are less capable than humans, so to the extent the analogy holds, they similarly won’t be able to be productive with much longer exploration that is essentially serial.
I considered this issue when writing the comment, but the range itself couldn’t be fixed, since both the decades-long-failure and month-long-deliberation seem important, and then there is the human lifespan. My impression is that adding non-concrete details to the kind of top-level comment I’m capable of writing makes it weaker. But the specific argument for not putting in this detail was that this is a legibly implausible kind of mistake for me to make, and such arguments feed the norm of others not pointing out mistakes, so on reflection I don’t endorse this decision. Perhaps I should use footnotes more.
If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years
I suspect that your intuition about human beings is misled because in humans “stick-to-it-ness” and “intelligence” (g-factor) are strongly positively correlated. That is, in almost all cases of human genius, the best-of-the-best have both very high IQ and have spent a long time thinking about the problem they are interested in. In fact, inference compute is probably more important among human geniuses, since it is unlikely that (in terms of raw flops) even the smartest human is as much as 2x above the average (since human brains are all roughly the same size).
Human reasoning that I’m comparing with is also using long reasoning traces, so unlocking the capability is part of the premise (many kinds of test time compute parallelize, but not in this case, so the analogy is more narrow than test time compute in general). The question is how much you can get from 3 orders of magnitude longer reasoning traces beyond the first additional 3 orders of magnitude, while thinking at a quality below that of the reference human. Current o1-like post-training doesn’t yet promise that scaling goes that far (it won’t even fit in a context, who knows if the scaling continues after workarounds for this are in place).
Human experience suggests to me that in humans scaling doesn’t go that far either. When a problem can be effectively reduced to simpler problems, then it wasn’t as difficult after all. And so the ratchet of science advances, at a linear and not logarithmic speed, within the bounds of human-feasible difficulty. The 300x of excess speed is a lot to overcome for a slowdown due to orders of magnitude longer reasoning traces than feasible in human experience, for a single problem that resists more modular analysis.
Human experience suggests to me that in humans scaling doesn’t go that far either.
For biological reasons, humans do not think about problems for 1000′s of years. A human who gives a problem a good 2-hour think is within <3 OOM of a human who spends their entire career working on a single problem.
If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years. To the extent o1-like post-training enables something like System 2 reasoning, humans seem like a reasonable anchor for such plateaus. Larger LLMs generate about 100 output tokens per second (counting speculative decoding; processing of input tokens parallelizes). A human, let’s say 8 hours a day and 1 token per second, is 300 times slower.
Thus models that can solve a difficult problem at all, will probably be able to do so faster than humans (in physical time). For a model that’s AGI, this likely translates into acceleration of AI research, enabling thinking of useful thoughts even faster.
Since the scaling is logarithmic, your example seems to be a strawman.
The real claim debated is more something like:
“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months” And this formulation doesn’t seem obviously true.
What I mean[1] is that it seems unlikely relative to what the scale implies, the graph on the log-scale levels off before it gets there. This claim depends on the existence of a reference human who solves the problem in 1 month, since there are some hard problems that take 30 years, but those aren’t relevant to the claim, since it’s about the range of useful slowdowns relative to human effort. The 1-month human remains human on the other side of the analogy, so doesn’t get impossible levels of starting knowledge, instead it’s the 20-year-failing human who becomes a 200-million-token-failing AI that fails despite a knowledge advantage.
That is another implied claim, though it’s not actually observable as evidence, and requires the 10,000 months to pass without advancements in relevant externally generated science (which is easier to imagine for 20 years with a sufficiently obscure problem). Progress like that is possible for sufficiently capable humans, but then I think there won’t be an even more capable human that solves it in 1 month. The relevant AIs are less capable than humans, so to the extent the analogy holds, they similarly won’t be able to be productive with much longer exploration that is essentially serial.
I considered this issue when writing the comment, but the range itself couldn’t be fixed, since both the decades-long-failure and month-long-deliberation seem important, and then there is the human lifespan. My impression is that adding non-concrete details to the kind of top-level comment I’m capable of writing makes it weaker. But the specific argument for not putting in this detail was that this is a legibly implausible kind of mistake for me to make, and such arguments feed the norm of others not pointing out mistakes, so on reflection I don’t endorse this decision. Perhaps I should use footnotes more.
AI researchers have found that it is possible to trade inference compute for training compute across a wide variety of domains including: image generation, robotic control, game playing, computer programming and solving math problems.
I suspect that your intuition about human beings is misled because in humans “stick-to-it-ness” and “intelligence” (g-factor) are strongly positively correlated. That is, in almost all cases of human genius, the best-of-the-best have both very high IQ and have spent a long time thinking about the problem they are interested in. In fact, inference compute is probably more important among human geniuses, since it is unlikely that (in terms of raw flops) even the smartest human is as much as 2x above the average (since human brains are all roughly the same size).
Human reasoning that I’m comparing with is also using long reasoning traces, so unlocking the capability is part of the premise (many kinds of test time compute parallelize, but not in this case, so the analogy is more narrow than test time compute in general). The question is how much you can get from 3 orders of magnitude longer reasoning traces beyond the first additional 3 orders of magnitude, while thinking at a quality below that of the reference human. Current o1-like post-training doesn’t yet promise that scaling goes that far (it won’t even fit in a context, who knows if the scaling continues after workarounds for this are in place).
Human experience suggests to me that in humans scaling doesn’t go that far either. When a problem can be effectively reduced to simpler problems, then it wasn’t as difficult after all. And so the ratchet of science advances, at a linear and not logarithmic speed, within the bounds of human-feasible difficulty. The 300x of excess speed is a lot to overcome for a slowdown due to orders of magnitude longer reasoning traces than feasible in human experience, for a single problem that resists more modular analysis.
For biological reasons, humans do not think about problems for 1000′s of years. A human who gives a problem a good 2-hour think is within <3 OOM of a human who spends their entire career working on a single problem.