Nice update. My probabilities on all of those are similar.
Here are some related thoughts, dashed off.
I spent a bunch of time looking at cognitive psychology experimental results. Looking at those, we’d actually question whether humans could do general reasoning.
Humans clearly can do general reasoning. But it’s not easy for us. The Wason card sorting task is one remarkable demonstration of just how dumb humans are in some situations, but there are lots of similar ones.
Edit: Here’s the task, but you don’t need to understand it to understand the rest of what I wrote. You probably know it, but it’s kind of interesting if you don’t. If you want to do this, it’s interesting to force yourself to guess quickly, giving your first thought (system 1) before thinking it through (system 2).
There are four cards on the table. Your task is to determine if the following rule holds in thee deck those cards were drawn from. The rule is: any card with an odd number on it has a vowel on the other side. The question is: which cards should you flip to determine whether that rule is being followed?
You see the following face-up on four cards:
| 7 | | E | | 4 | | G |
Quick! Which do you flip to test the rule?
Now, think it through.
Those experiments were usually studying “cognitive biases”, so the experimenters had set out to find tasks that would be “gotchas” for the subjects. That’s pretty similar to the tasks current LLMs tend to fail on.
And those experiments were usually performed either on undergraduate students who were participating for “class credit”, or on Mechanical Turk for a few bucks. Almost none of them actually rewarded right answers enough to make anyone care enough to waste time thinking about it.
Both are largely testing “system 1” thinking—what the human brain produces in “single forward pass” in Transformer terms, or as a habitual response in animal neuroscience terms. Subjectively, it’s just saying the first thing that comes to mind.
o1 performs much better because it’s doing what we’d call System 2 processing in humans. It’s reaching answers through multi-step cognition. Subjectively, we’d say we “thought about it”.
I originally thought that scaffolding LLMs would provide good System 2 performance by providing explicit cues for stepping through a general reasoning process. I now think it’ doesn’t work easily, because the training on written language doesn’t have enough examples of people explicitly stating their cognitive steps in applying System 2 reasoning. Even in textbooks, too much is left implicit, since everyone can guess the results.
o1 seems to have been produced by finding a way to make a synthetic dataset of logical steps, captured in language, similar to the ones we use to “think through” problems to produce System 2 processing.
So here’s my guess on whether LLMs reach general reasoning by pure scaling: it doesn’t matter. What matters is whether they can reach it when they’re applied in a sequential, “reasoning”′ approach that we call System 2 in humans. That could be achieved by training, as in o1, or scaffolding, or more likely, a combination of both.
It really looks to me like that will reach human-level general reasoning. Of course I’m not sure. The big question is whether some other approach will surpass it first.
The other thing humans do to look way smarter than our “base reasoning” is to just memorize any of our conclusions that seem important. The next time someone shows us the Wason problem, we can show off looking like we can reason quickly and correctly—even if it took us ten tries and googling it the first time. Scaffolded language models haven’t used memory systems that are good enough to store conclusions in context and retrieve them quickly. But that’s purely a technical problem.
I now think it doesn’t work easily, because the training on written language doesn’t have enough examples of people explicitly stating their cognitive steps in applying System 2 reasoning.
The cognitive steps are still part of the hidden structure that generated the data. That GPT-4 level models are unable to capture them is not necessarily evidence that it’s very hard. They’ve just breached the reading comprehension threshold, started to reliably understand most nuanced meaning directly given in the text.
Only in second half of 2024 there’s now enough compute to start experimenting with scale significantly beyond GPT-4 level (with possible recent results still hidden within frontier labs). Before that there wasn’t opportunity to see if something else starts appearing just after GPT-4 scale, so absence of such evidence isn’t yet evidence of absence, that additional currently-absent capabilities aren’t within easy reach. It’s been 2 years at about the same scale of base models, but that isn’t evidence that additional scale stops helping in crucial ways, as no experiments with significant additional scale have been performed in those 2 years.
I totally agree. Natural language datasets do have the right information embedded in them; it’s just obscured by a lot of other stuff. Compute alone might be enough to bring it out.
Part of my original hypothesis was that even a small improvement in the base model might be enough to make scaffolded System 2 type thinking very effective. It’s hard to guess when a system could get past the threshold of having more thinking work better, like it does for humans (with diminishing returns). It could come frome a small improvement in the scaffolding, or a small improvement in memory systems, or even from better feedback from outside sources (e.g., using web searches and better distinguishing good from bad information).
All of those factors are critical in human thinking, and our abilities are clearly a nonlinear product of separate cognitive capacities. That’s why I expect improvements in any or all of those dimensions to eventually lead to human-plus fluid intelligence. And since efforts are underway on each of those dimensions, I’d guess we see that level sooner than later. Two years is my median guess for human level reasoning on most problems, maybe all. But we might still not have good online learning allowing, for a relevant instance, for the system to be trained on any arbitrary job and to then do it competently. Fortunately I expect it to scale past human level at a relatively slow pace from there, giving us a few more years to get our shit together once we’re staring roughly human-equivalent agents in the face and so start to take the potentials seriously.
That all seems pretty right to me. It continues to be difficult to fully define ‘general reasoning’, and my mental model of it continues to evolve, but I think of ‘system 2 reasoning’ as at least a partial synonym.
Humans clearly can do general reasoning. But it’s not easy for us.
It continues to be difficult to fully define ‘general reasoning’, and my mental model of it continues to evolve, but I think of ‘system 2 reasoning’ as at least a partial synonym.
In the medium-to-long term I’m inclined to taboo the word and talk about what I understand as its component parts, which I currently (off the top of my head) think of as something like:
The ability to do deduction, induction, and abduction.
The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).
The ability to do all of that in a domain-independent way.
The ability to use all of that to build a self-consistent internal model of the domain under consideration.
Don’t hold me to that, though, it’s still very much evolving. I may do a short-form post with just the above to invite critique.
I like trying to define general reasoning; I also don’t have a good definition. I think it’s tricky.The ability to do deduction, induction, and abduction.
The ability to do deduction, induction, and abduction.
I think you’ve got to define how well it does each of these. As you noted on that very difficult math benchmark comment, saying they can do general reasoning doesn’t mean doing it infinitely well.
The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).
I don’t know about this one. Humans seem to make a very large number of errors, but muddle through by recognizing at above-chance levels when they’re more likely to be correct—then building on that occasional success. So I think there are two routes to useful general-purpose reasoning—doing it well, or being able to judge success at above-chance and then remember it for future use one way or another.
The ability to do all of that in a domain-independent way.
The ability to use all of that to build a self-consistent internal model of the domain under consideration.
Here again, I think we shouldn’t overstimate how self-consistent or complete a model humans use when they make progress on difficult problems. It’s consistent and complete enough, but probably far from perfect.
Nice update. My probabilities on all of those are similar.
Here are some related thoughts, dashed off.
I spent a bunch of time looking at cognitive psychology experimental results. Looking at those, we’d actually question whether humans could do general reasoning.
Humans clearly can do general reasoning. But it’s not easy for us. The Wason card sorting task is one remarkable demonstration of just how dumb humans are in some situations, but there are lots of similar ones.
Edit: Here’s the task, but you don’t need to understand it to understand the rest of what I wrote. You probably know it, but it’s kind of interesting if you don’t. If you want to do this, it’s interesting to force yourself to guess quickly, giving your first thought (system 1) before thinking it through (system 2).
Those experiments were usually studying “cognitive biases”, so the experimenters had set out to find tasks that would be “gotchas” for the subjects. That’s pretty similar to the tasks current LLMs tend to fail on.
And those experiments were usually performed either on undergraduate students who were participating for “class credit”, or on Mechanical Turk for a few bucks. Almost none of them actually rewarded right answers enough to make anyone care enough to waste time thinking about it.
Both are largely testing “system 1” thinking—what the human brain produces in “single forward pass” in Transformer terms, or as a habitual response in animal neuroscience terms. Subjectively, it’s just saying the first thing that comes to mind.
o1 performs much better because it’s doing what we’d call System 2 processing in humans. It’s reaching answers through multi-step cognition. Subjectively, we’d say we “thought about it”.
I originally thought that scaffolding LLMs would provide good System 2 performance by providing explicit cues for stepping through a general reasoning process. I now think it’ doesn’t work easily, because the training on written language doesn’t have enough examples of people explicitly stating their cognitive steps in applying System 2 reasoning. Even in textbooks, too much is left implicit, since everyone can guess the results.
o1 seems to have been produced by finding a way to make a synthetic dataset of logical steps, captured in language, similar to the ones we use to “think through” problems to produce System 2 processing.
So here’s my guess on whether LLMs reach general reasoning by pure scaling: it doesn’t matter. What matters is whether they can reach it when they’re applied in a sequential, “reasoning”′ approach that we call System 2 in humans. That could be achieved by training, as in o1, or scaffolding, or more likely, a combination of both.
It really looks to me like that will reach human-level general reasoning. Of course I’m not sure. The big question is whether some other approach will surpass it first.
The other thing humans do to look way smarter than our “base reasoning” is to just memorize any of our conclusions that seem important. The next time someone shows us the Wason problem, we can show off looking like we can reason quickly and correctly—even if it took us ten tries and googling it the first time. Scaffolded language models haven’t used memory systems that are good enough to store conclusions in context and retrieve them quickly. But that’s purely a technical problem.
The cognitive steps are still part of the hidden structure that generated the data. That GPT-4 level models are unable to capture them is not necessarily evidence that it’s very hard. They’ve just breached the reading comprehension threshold, started to reliably understand most nuanced meaning directly given in the text.
Only in second half of 2024 there’s now enough compute to start experimenting with scale significantly beyond GPT-4 level (with possible recent results still hidden within frontier labs). Before that there wasn’t opportunity to see if something else starts appearing just after GPT-4 scale, so absence of such evidence isn’t yet evidence of absence, that additional currently-absent capabilities aren’t within easy reach. It’s been 2 years at about the same scale of base models, but that isn’t evidence that additional scale stops helping in crucial ways, as no experiments with significant additional scale have been performed in those 2 years.
I totally agree. Natural language datasets do have the right information embedded in them; it’s just obscured by a lot of other stuff. Compute alone might be enough to bring it out.
Part of my original hypothesis was that even a small improvement in the base model might be enough to make scaffolded System 2 type thinking very effective. It’s hard to guess when a system could get past the threshold of having more thinking work better, like it does for humans (with diminishing returns). It could come frome a small improvement in the scaffolding, or a small improvement in memory systems, or even from better feedback from outside sources (e.g., using web searches and better distinguishing good from bad information).
All of those factors are critical in human thinking, and our abilities are clearly a nonlinear product of separate cognitive capacities. That’s why I expect improvements in any or all of those dimensions to eventually lead to human-plus fluid intelligence. And since efforts are underway on each of those dimensions, I’d guess we see that level sooner than later. Two years is my median guess for human level reasoning on most problems, maybe all. But we might still not have good online learning allowing, for a relevant instance, for the system to be trained on any arbitrary job and to then do it competently. Fortunately I expect it to scale past human level at a relatively slow pace from there, giving us a few more years to get our shit together once we’re staring roughly human-equivalent agents in the face and so start to take the potentials seriously.
That all seems pretty right to me. It continues to be difficult to fully define ‘general reasoning’, and my mental model of it continues to evolve, but I think of ‘system 2 reasoning’ as at least a partial synonym.
Agreed; not only are we very limited at it, but we often aren’t doing it at all.
I agree that it may be possible to achieve it with scaffolding even if LLMs don’t get there on their own; I’m just less certain of it.
In the medium-to-long term I’m inclined to taboo the word and talk about what I understand as its component parts, which I currently (off the top of my head) think of as something like:
The ability to do deduction, induction, and abduction.
The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).
The ability to do all of that in a domain-independent way.
The ability to use all of that to build a self-consistent internal model of the domain under consideration.
Don’t hold me to that, though, it’s still very much evolving. I may do a short-form post with just the above to invite critique.
I like trying to define general reasoning; I also don’t have a good definition. I think it’s tricky.The ability to do deduction, induction, and abduction.
I think you’ve got to define how well it does each of these. As you noted on that very difficult math benchmark comment, saying they can do general reasoning doesn’t mean doing it infinitely well.
I don’t know about this one. Humans seem to make a very large number of errors, but muddle through by recognizing at above-chance levels when they’re more likely to be correct—then building on that occasional success. So I think there are two routes to useful general-purpose reasoning—doing it well, or being able to judge success at above-chance and then remember it for future use one way or another.
Here again, I think we shouldn’t overstimate how self-consistent or complete a model humans use when they make progress on difficult problems. It’s consistent and complete enough, but probably far from perfect.
Yeah, very fair point, those are at least in part defining a scale rather than a threshold (especially the error-free and consistent-model criteria).