My impression (which isn’t based on extensive knowledge, so I’m happy to be corrected) is that the models have got better at lots of individual tasks but the shape of their behaviour when faced with a task that’s a bit too hard for them hasn’t changed much: they offer an answer some part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; rinse and repeat.
So far, that pattern doesn’t seem to have changed much as the models have got better. You need to ask harder questions to make it happen, because they’ve got better at the various tasks, but once the questions get hard enough that they don’t really understand, back comes the “I’m sorry, I was wrong” cycle pretty much the same as it ever was.
That’s what something being impossible to explain looks like, the whack-a-mole pattern of correcting one problem only to get another, and the process never converges on correct understanding. As models improve, things that were borderline possible to explain start working without a need for explanation.
For long-horizon tasks, things that would need to be possible to explain are general reasoning skills (as in How to Solve It, or what it means for something to be an actual proof). The whack-a-mole level of failure would need to go away on questions of validity of reasoning steps or appropriateness of choice of the next step of a plan. The analogy suggests that first it would become possible to explain and discuss these issues, at the level of general skills themselves rather than of the object-level issue that the skills are being applied to. And then another step of scaling would enable a model to do a reasonable job of wielding such skills on its own.
There is an ambiguity here, between whack-a-mole on an object level question, and for example whack-a-mole on explaining to the chatbot the whack-a-mole pattern itself. Even if the pattern remains the same as the feasible difficulty of the object level questions increases for better models, at some point the pattern itself can become such an object level question that’s no longer impossible to explain.
I’m suggesting that the fact that things the model can’t do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn’t really changed as the models have grown better at individual tasks may indicate something fundamental that’s missing from all models in this class, and that might not go away until some new fundamental insight comes along: more “steps of scaling” might not do the trick.
Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and “general reasoning” lacunae you mention, and that those might turn out also to be things that models of this kind don’t get good at just by being scaled further.
But I’m aware that people saying “these models will never be able to do X” tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like “mere computers will never be truly able to think”; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more “steps of scaling” to fill those holes.
My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.
That seems reasonable.
My impression (which isn’t based on extensive knowledge, so I’m happy to be corrected) is that the models have got better at lots of individual tasks but the shape of their behaviour when faced with a task that’s a bit too hard for them hasn’t changed much: they offer an answer some part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; rinse and repeat.
So far, that pattern doesn’t seem to have changed much as the models have got better. You need to ask harder questions to make it happen, because they’ve got better at the various tasks, but once the questions get hard enough that they don’t really understand, back comes the “I’m sorry, I was wrong” cycle pretty much the same as it ever was.
That’s what something being impossible to explain looks like, the whack-a-mole pattern of correcting one problem only to get another, and the process never converges on correct understanding. As models improve, things that were borderline possible to explain start working without a need for explanation.
For long-horizon tasks, things that would need to be possible to explain are general reasoning skills (as in How to Solve It, or what it means for something to be an actual proof). The whack-a-mole level of failure would need to go away on questions of validity of reasoning steps or appropriateness of choice of the next step of a plan. The analogy suggests that first it would become possible to explain and discuss these issues, at the level of general skills themselves rather than of the object-level issue that the skills are being applied to. And then another step of scaling would enable a model to do a reasonable job of wielding such skills on its own.
There is an ambiguity here, between whack-a-mole on an object level question, and for example whack-a-mole on explaining to the chatbot the whack-a-mole pattern itself. Even if the pattern remains the same as the feasible difficulty of the object level questions increases for better models, at some point the pattern itself can become such an object level question that’s no longer impossible to explain.
I’m suggesting that the fact that things the model can’t do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn’t really changed as the models have grown better at individual tasks may indicate something fundamental that’s missing from all models in this class, and that might not go away until some new fundamental insight comes along: more “steps of scaling” might not do the trick.
Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and “general reasoning” lacunae you mention, and that those might turn out also to be things that models of this kind don’t get good at just by being scaled further.
But I’m aware that people saying “these models will never be able to do X” tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like “mere computers will never be truly able to think”; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more “steps of scaling” to fill those holes.
My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.