It’s pretty good. I tried it on a few mathematical questions.
First of all, a version of the standard AIW problem from the recent “Alice in Wonderland” paper. It got this right (not very surprisingly as other leading models also do, at least much of the time). Then a version of the “AIW+” problem which is much more confusing. Its answer was wrong, but its method (which it explained) was pretty much OK and I am not sure it was any wronger than I would be on average trying to answer that question in real time.
Then some more conceptual mathematical puzzles. I took them from recent videos on Michael Penn’s YouTube channel. (His videos are commonly about undergraduate or easyish-olympiad-style pure mathematics. They seem unlikely to be in Claude’s training data, though of course other things containing the same problems might be.)
One pretty straightforward one: how many distinct factorials can you find that all end in the same number of zeros? It wrote down the correct formula for the number of zeros, then started enumerating particular numbers and got some things wrong, tried to do pattern-spotting, and gave a hilariously wrong answer; when gently nudged, it corrected itself kinda-adequately and gave an almost-correct answer (which it corrected properly when nudged again) but I didn’t get much feeling of real understanding.
Another (an exercise from Knuth’s TAOCP; he rates its difficulty HM22, meaning it needs higher mathematics and should take you 25 minutes or so; it’s about the relationship between two functions whose Taylor series coefficients differ by a factor H(n), the n’th harmonic number) it solved straight off and quite neatly.
Another (find all functions with (f(x)-f(y))/(x-y) = f’((x+y)/2) for all distinct x,y) it initially “solved” with a solution with a completely invalid step. When I said I couldn’t follow that step, it gave a fairly neat solution that works if you assume f is real-analytic (has a Taylor series expansion everywhere). This is also the first thing that occurred to me when I thought about the problem. When asked for a solution that doesn’t make that assumption, it unfortunately gave another invalid solution, and when prodded about that it gave another invalid one. Further prompting, even giving it a pretty big hint in the direction of a nice neat solution (better than Penn’s :-)), didn’t manage to produce a genuinely correct solution.
I rate it “not terribly good undergraduate at a good university”, I think, but—as with all these models to date—with tragically little “self-awareness”, in the sense that it’ll give a wrong answer, and you’ll poke it, and it’ll apologize effusively and give another wrong answer, and you can repeat this several times without making it change its approach or say “sorry, it seems I’m just not smart enough to solve this one” or anything.
On the one hand, the fact that we have AI systems that can do mathematics about as well as a not-very-good undergraduate (and quite a bit faster) is fantastically impressive. On the other hand, it really does feel as if something fairly fundamental is missing. If I were teaching an actual undergraduate whose answers were like Claude’s, I’d worry that there was something wrong with their brain that somehow had left them kinda able to do mathematics. I wouldn’t bet heavily that just continuing down the current path won’t get us to “genuinely smart people really thinking hard with actual world models” levels of intelligence in the nearish future, but I think that’s still the way I’d bet.
(Of course a system that’s at the “not very good undergraduate” level in everything, which I’m guessing is roughly what this is, is substantially superhuman in some important respects. And I don’t intend to imply that it doesn’t matter whether Anthropic are lax about what they release just because the latest thing happens not to be smart enough to be particularly dangerous.)
Capability of a chatbot to understand when extensively coached seems to indicate what the next generation will be able to do on its own, and elicitation of this capability is probably less sensitive to details of post-training than seeing what it can do zero-shot or with only oblique nudging. The quine puzzle I posted could only be explained to the strongest preceding models, which were unable to solve it on their own, and can’t be explained to even weaker models at all.
So for long-horizon task capabilities, I’m placing some weight on checking if chatbots start understanding unusually patient and detailed in-context instruction on applying general planning or problem-solving skills to particular examples. They seem to be getting slightly better.
My impression (which isn’t based on extensive knowledge, so I’m happy to be corrected) is that the models have got better at lots of individual tasks but the shape of their behaviour when faced with a task that’s a bit too hard for them hasn’t changed much: they offer an answer some part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; rinse and repeat.
So far, that pattern doesn’t seem to have changed much as the models have got better. You need to ask harder questions to make it happen, because they’ve got better at the various tasks, but once the questions get hard enough that they don’t really understand, back comes the “I’m sorry, I was wrong” cycle pretty much the same as it ever was.
That’s what something being impossible to explain looks like, the whack-a-mole pattern of correcting one problem only to get another, and the process never converges on correct understanding. As models improve, things that were borderline possible to explain start working without a need for explanation.
For long-horizon tasks, things that would need to be possible to explain are general reasoning skills (as in How to Solve It, or what it means for something to be an actual proof). The whack-a-mole level of failure would need to go away on questions of validity of reasoning steps or appropriateness of choice of the next step of a plan. The analogy suggests that first it would become possible to explain and discuss these issues, at the level of general skills themselves rather than of the object-level issue that the skills are being applied to. And then another step of scaling would enable a model to do a reasonable job of wielding such skills on its own.
There is an ambiguity here, between whack-a-mole on an object level question, and for example whack-a-mole on explaining to the chatbot the whack-a-mole pattern itself. Even if the pattern remains the same as the feasible difficulty of the object level questions increases for better models, at some point the pattern itself can become such an object level question that’s no longer impossible to explain.
I’m suggesting that the fact that things the model can’t do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn’t really changed as the models have grown better at individual tasks may indicate something fundamental that’s missing from all models in this class, and that might not go away until some new fundamental insight comes along: more “steps of scaling” might not do the trick.
Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and “general reasoning” lacunae you mention, and that those might turn out also to be things that models of this kind don’t get good at just by being scaled further.
But I’m aware that people saying “these models will never be able to do X” tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like “mere computers will never be truly able to think”; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more “steps of scaling” to fill those holes.
My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.
It’s pretty good. I tried it on a few mathematical questions.
First of all, a version of the standard AIW problem from the recent “Alice in Wonderland” paper. It got this right (not very surprisingly as other leading models also do, at least much of the time). Then a version of the “AIW+” problem which is much more confusing. Its answer was wrong, but its method (which it explained) was pretty much OK and I am not sure it was any wronger than I would be on average trying to answer that question in real time.
Then some more conceptual mathematical puzzles. I took them from recent videos on Michael Penn’s YouTube channel. (His videos are commonly about undergraduate or easyish-olympiad-style pure mathematics. They seem unlikely to be in Claude’s training data, though of course other things containing the same problems might be.)
One pretty straightforward one: how many distinct factorials can you find that all end in the same number of zeros? It wrote down the correct formula for the number of zeros, then started enumerating particular numbers and got some things wrong, tried to do pattern-spotting, and gave a hilariously wrong answer; when gently nudged, it corrected itself kinda-adequately and gave an almost-correct answer (which it corrected properly when nudged again) but I didn’t get much feeling of real understanding.
Another (an exercise from Knuth’s TAOCP; he rates its difficulty HM22, meaning it needs higher mathematics and should take you 25 minutes or so; it’s about the relationship between two functions whose Taylor series coefficients differ by a factor H(n), the n’th harmonic number) it solved straight off and quite neatly.
Another (find all functions with (f(x)-f(y))/(x-y) = f’((x+y)/2) for all distinct x,y) it initially “solved” with a solution with a completely invalid step. When I said I couldn’t follow that step, it gave a fairly neat solution that works if you assume f is real-analytic (has a Taylor series expansion everywhere). This is also the first thing that occurred to me when I thought about the problem. When asked for a solution that doesn’t make that assumption, it unfortunately gave another invalid solution, and when prodded about that it gave another invalid one. Further prompting, even giving it a pretty big hint in the direction of a nice neat solution (better than Penn’s :-)), didn’t manage to produce a genuinely correct solution.
I rate it “not terribly good undergraduate at a good university”, I think, but—as with all these models to date—with tragically little “self-awareness”, in the sense that it’ll give a wrong answer, and you’ll poke it, and it’ll apologize effusively and give another wrong answer, and you can repeat this several times without making it change its approach or say “sorry, it seems I’m just not smart enough to solve this one” or anything.
On the one hand, the fact that we have AI systems that can do mathematics about as well as a not-very-good undergraduate (and quite a bit faster) is fantastically impressive. On the other hand, it really does feel as if something fairly fundamental is missing. If I were teaching an actual undergraduate whose answers were like Claude’s, I’d worry that there was something wrong with their brain that somehow had left them kinda able to do mathematics. I wouldn’t bet heavily that just continuing down the current path won’t get us to “genuinely smart people really thinking hard with actual world models” levels of intelligence in the nearish future, but I think that’s still the way I’d bet.
(Of course a system that’s at the “not very good undergraduate” level in everything, which I’m guessing is roughly what this is, is substantially superhuman in some important respects. And I don’t intend to imply that it doesn’t matter whether Anthropic are lax about what they release just because the latest thing happens not to be smart enough to be particularly dangerous.)
Capability of a chatbot to understand when extensively coached seems to indicate what the next generation will be able to do on its own, and elicitation of this capability is probably less sensitive to details of post-training than seeing what it can do zero-shot or with only oblique nudging. The quine puzzle I posted could only be explained to the strongest preceding models, which were unable to solve it on their own, and can’t be explained to even weaker models at all.
So for long-horizon task capabilities, I’m placing some weight on checking if chatbots start understanding unusually patient and detailed in-context instruction on applying general planning or problem-solving skills to particular examples. They seem to be getting slightly better.
That seems reasonable.
My impression (which isn’t based on extensive knowledge, so I’m happy to be corrected) is that the models have got better at lots of individual tasks but the shape of their behaviour when faced with a task that’s a bit too hard for them hasn’t changed much: they offer an answer some part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; you query this bit; they say “I’m sorry, I was wrong” and offer a new answer some different part of which is nonsense; rinse and repeat.
So far, that pattern doesn’t seem to have changed much as the models have got better. You need to ask harder questions to make it happen, because they’ve got better at the various tasks, but once the questions get hard enough that they don’t really understand, back comes the “I’m sorry, I was wrong” cycle pretty much the same as it ever was.
That’s what something being impossible to explain looks like, the whack-a-mole pattern of correcting one problem only to get another, and the process never converges on correct understanding. As models improve, things that were borderline possible to explain start working without a need for explanation.
For long-horizon tasks, things that would need to be possible to explain are general reasoning skills (as in How to Solve It, or what it means for something to be an actual proof). The whack-a-mole level of failure would need to go away on questions of validity of reasoning steps or appropriateness of choice of the next step of a plan. The analogy suggests that first it would become possible to explain and discuss these issues, at the level of general skills themselves rather than of the object-level issue that the skills are being applied to. And then another step of scaling would enable a model to do a reasonable job of wielding such skills on its own.
There is an ambiguity here, between whack-a-mole on an object level question, and for example whack-a-mole on explaining to the chatbot the whack-a-mole pattern itself. Even if the pattern remains the same as the feasible difficulty of the object level questions increases for better models, at some point the pattern itself can become such an object level question that’s no longer impossible to explain.
I’m suggesting that the fact that things the model can’t do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn’t really changed as the models have grown better at individual tasks may indicate something fundamental that’s missing from all models in this class, and that might not go away until some new fundamental insight comes along: more “steps of scaling” might not do the trick.
Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and “general reasoning” lacunae you mention, and that those might turn out also to be things that models of this kind don’t get good at just by being scaled further.
But I’m aware that people saying “these models will never be able to do X” tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like “mere computers will never be truly able to think”; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more “steps of scaling” to fill those holes.
My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don’t come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one’s mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that’s needed for it to work without a need for discussing it.
At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.
Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).
With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an “it doesn’t look like anything to me” impression even after you do point it out). They won’t be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it’s never worth it if the goal was to solve the problem.
Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn’t course-correct at all.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.