My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.
My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.