Humans without scaffolding can do a very finite number of sequential reasoning steps without mistakes. That’s why thinking aids like paper, whiteboards, and other people to bounce ideas off and keep the cache fresh are so useful.
I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.
…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.
Obviously LLMs can deal with addition and multiplication and modular arithmetic etc. But I would argue that this tower of concepts building on other concepts was built by humans, and then handed to the LLM on a silver platter. I join OP in being skeptical that LLMs (including o3 etc.) could have built that tower themselves from scratch, the way humans did historically. And I for one don’t expect them to be able to do that thing until an AI paradigm shift happens.
[...] I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.
…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.
As a clarification for anyone wondering why I didn’t use a framing more like this in the post, it’s because I think these types of reasoning (horizontal and vertical/A and C) are related in an important way, even though I agree that C might be qualitatively harder than A (hence section §3.1). Or to put it differently, if one extreme position is “we can look entirely at A to extrapolate LLM performance into the future” and the other is “A and C are so different that progress on A is basically uninteresting”, then my view is somewhere near the middle.
This is true but I don’t think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.
I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).
Humans without scaffolding can do a very finite number of sequential reasoning steps without mistakes. That’s why thinking aids like paper, whiteboards, and other people to bounce ideas off and keep the cache fresh are so useful.
I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.
…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.
Obviously LLMs can deal with addition and multiplication and modular arithmetic etc. But I would argue that this tower of concepts building on other concepts was built by humans, and then handed to the LLM on a silver platter. I join OP in being skeptical that LLMs (including o3 etc.) could have built that tower themselves from scratch, the way humans did historically. And I for one don’t expect them to be able to do that thing until an AI paradigm shift happens.
As a clarification for anyone wondering why I didn’t use a framing more like this in the post, it’s because I think these types of reasoning (horizontal and vertical/A and C) are related in an important way, even though I agree that C might be qualitatively harder than A (hence section §3.1). Or to put it differently, if one extreme position is “we can look entirely at A to extrapolate LLM performance into the future” and the other is “A and C are so different that progress on A is basically uninteresting”, then my view is somewhere near the middle.
This is true but I don’t think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.
I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).