My impression so far was that transformer models do not learn search in chess and you are careful to only speak about lookahead. I would suggest that even that is not necessarily the case: I suspect the models learn to recognise multi-move patterns. I.e. they recognise positions that allow certain multi-move tactical strikes.
To tease search/lookahead and pattern recognition apart I started creating a benchmark with positions that are solved by surprising and unintuitive moves, but I really didn’t have any time to keep working on this idea and it has been on hold for a couple of months.
What is the difference between multi-move pattern recognition and lookahead/search?
Lookahead/search is a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation. In the case of lookahead this general algorithm takes the move prediction and goes through the positions that arise when making the predicted moves while assessing these positions.
Multi-move pattern recognition starts out as simple pattern recognition: The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
Sometimes it will predict this move although the combination, the multi-move tactical strike, doesn’t quite work, but over time it will learn that Ng6 is unlikely when there is a black bishop on d2 ready to drop back to h6 or when there is a black knight on g2, guarding the square h4 where the rook would have to checkmate.
This is what I call multi-move pattern recognition: The network has to learn a lot of details and conditions to predict when the tactical strike (a multi-move pattern) works and when it doesn’t. In this case you would make the same observations that where described in this post: For example if you’d ablate the square h4, you’d lose the information of whether this will be available for the rook in the second move. It is important for the pattern recognition to know where future pieces have to go.
But the crucial difference to lookahead or search is that this is not a general mechanism. Quite the contrary, it is the result of increasing specialisation on this particular type of position. If you’d remove a certain tactical pattern from the training data the NN would be unable to find it.
It is exactly the ability of system 2 thinking to generalise much further than this that makes the question of whether transformers develop it so important.
I think the methods described in this post are even in principle unable to distinguish between multi-move pattern recognition and lookahead/search. Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.
Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:
The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is “effectively pinned,” and so g6 isn’t actually protected—this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.
From that perspective, I’d guess Leela sits somewhere in the middle of that spectrum. I agree it’s likely not implementing “a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation” in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for “considering future moves” and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.
There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the “pattern recognition”-network to predict that Ng6 is not a feasible option.
The “lookahead”-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like “block mate”, “move bishop/queen on a diagonal” to generalise to unseen combinations of patterns.
I still don’t see the crisp boundary you seem to be getting at between “pattern recognition building on general circuits” and what you call “look-ahead.” It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the “pattern recognition”-network to predict that Ng6 is not a feasible option.
If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:
This doesn’t generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn’t also explicitly seen it)
This generalizes to rook checkmates along the h-file, but doesn’t generalize to rook checkmates along other files
This generalizes to arbitrary rook checkmates
This also generalizes to bishop checkmates being blocked
This also generalizes to a rook trapping the opponent queen (instead of the king)
...
(Of course, this generalization question is likely related to the question of whether these different cases share “mechanisms.”)
At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of “difficulty” (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we’d expect to actually see; for example, humans don’t do this either). You might have something different/broader in mind for “look-ahead,” but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.
I don’t think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.
Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?
If it relies on pattern recognition it wouldn’t—it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn’t really learned lookahead for 0% or a very low percentage and that’s what I would expect.
I don’t think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Maybe I misunderstood you then, and tbc I agree that you don’t need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)
For your thought experiment, my prediction would depend on the specifics of what this “tactical motive” looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I’m much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn’t really learned lookahead for 0% or a very low percentage and that’s what I would expect.
Note that 0% to 100% accuracy is not the main spectrum I’m thinking of (though I agree it’s also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that’s more specific to this experiment).
Hmm, yeah, I think we are talking past each other.
Everything you describe is just pattern recognition to me. Lookahead or search does not depend on the broadness of the motive.
Lookahead, to me, is the ability to look ahead and see what is there. It allows very high certainty even for never before seen mating combinations.
If the line is forcing enough it allows finding very deep combinations (which you will never ever find with pattern recognition because the combinatorial explosions means that basically every deep combination has never been seen before).
In humans, it is clearly different from pattern recognition. Humans can see multi-move patterns in a glance. The example in the post I would play instantly in every blitz game. I would check the conditions of the pattern, but I wouldn’t have to “look ahead”.
Humans consider future moves even when intuitively assessing positions. “This should be winning, because I still have x,y and z in the position”. But actually calculating is clearly different because it is effortful. You have to force yourself to do it (or at least I usually have to). You manipulate the position sequentially in your mind and see what could happen. This allows you to see many things that you couldn’t predict from your past experience in similar positions
I didn’t want to get hung up on whether there is a crisp boundary. Maybe you are right and you just keep generalising and generalising until there is a search algo in the limit. I very much doubt this is where the ability of humans to calculate ahead comes from. In transformers? Who knows.
Very cool project!
My impression so far was that transformer models do not learn search in chess and you are careful to only speak about lookahead. I would suggest that even that is not necessarily the case: I suspect the models learn to recognise multi-move patterns. I.e. they recognise positions that allow certain multi-move tactical strikes.
To tease search/lookahead and pattern recognition apart I started creating a benchmark with positions that are solved by surprising and unintuitive moves, but I really didn’t have any time to keep working on this idea and it has been on hold for a couple of months.
I thought I spell this out a bit:
What is the difference between multi-move pattern recognition and lookahead/search?
Lookahead/search is a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation. In the case of lookahead this general algorithm takes the move prediction and goes through the positions that arise when making the predicted moves while assessing these positions.
Multi-move pattern recognition starts out as simple pattern recognition: The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
Sometimes it will predict this move although the combination, the multi-move tactical strike, doesn’t quite work, but over time it will learn that Ng6 is unlikely when there is a black bishop on d2 ready to drop back to h6 or when there is a black knight on g2, guarding the square h4 where the rook would have to checkmate.
This is what I call multi-move pattern recognition: The network has to learn a lot of details and conditions to predict when the tactical strike (a multi-move pattern) works and when it doesn’t. In this case you would make the same observations that where described in this post: For example if you’d ablate the square h4, you’d lose the information of whether this will be available for the rook in the second move. It is important for the pattern recognition to know where future pieces have to go.
But the crucial difference to lookahead or search is that this is not a general mechanism. Quite the contrary, it is the result of increasing specialisation on this particular type of position. If you’d remove a certain tactical pattern from the training data the NN would be unable to find it.
It is exactly the ability of system 2 thinking to generalise much further than this that makes the question of whether transformers develop it so important.
I think the methods described in this post are even in principle unable to distinguish between multi-move pattern recognition and lookahead/search. Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.
Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:
You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is “effectively pinned,” and so g6 isn’t actually protected—this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.
From that perspective, I’d guess Leela sits somewhere in the middle of that spectrum. I agree it’s likely not implementing “a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation” in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for “considering future moves” and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.
There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the “pattern recognition”-network to predict that Ng6 is not a feasible option.
The “lookahead”-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like “block mate”, “move bishop/queen on a diagonal” to generalise to unseen combinations of patterns.
I still don’t see the crisp boundary you seem to be getting at between “pattern recognition building on general circuits” and what you call “look-ahead.” It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:
If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:
This doesn’t generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn’t also explicitly seen it)
This generalizes to rook checkmates along the h-file, but doesn’t generalize to rook checkmates along other files
This generalizes to arbitrary rook checkmates
This also generalizes to bishop checkmates being blocked
This also generalizes to a rook trapping the opponent queen (instead of the king)
...
(Of course, this generalization question is likely related to the question of whether these different cases share “mechanisms.”)
At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of “difficulty” (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we’d expect to actually see; for example, humans don’t do this either). You might have something different/broader in mind for “look-ahead,” but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.
I don’t think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.
Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?
If it relies on pattern recognition it wouldn’t—it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn’t really learned lookahead for 0% or a very low percentage and that’s what I would expect.
Maybe I misunderstood you then, and tbc I agree that you don’t need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)
For your thought experiment, my prediction would depend on the specifics of what this “tactical motive” looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I’m much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.
Note that 0% to 100% accuracy is not the main spectrum I’m thinking of (though I agree it’s also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that’s more specific to this experiment).
Hmm, yeah, I think we are talking past each other.
Everything you describe is just pattern recognition to me. Lookahead or search does not depend on the broadness of the motive.
Lookahead, to me, is the ability to look ahead and see what is there. It allows very high certainty even for never before seen mating combinations.
If the line is forcing enough it allows finding very deep combinations (which you will never ever find with pattern recognition because the combinatorial explosions means that basically every deep combination has never been seen before).
In humans, it is clearly different from pattern recognition. Humans can see multi-move patterns in a glance. The example in the post I would play instantly in every blitz game. I would check the conditions of the pattern, but I wouldn’t have to “look ahead”.
Humans consider future moves even when intuitively assessing positions. “This should be winning, because I still have x,y and z in the position”. But actually calculating is clearly different because it is effortful. You have to force yourself to do it (or at least I usually have to). You manipulate the position sequentially in your mind and see what could happen. This allows you to see many things that you couldn’t predict from your past experience in similar positions
I didn’t want to get hung up on whether there is a crisp boundary. Maybe you are right and you just keep generalising and generalising until there is a search algo in the limit. I very much doubt this is where the ability of humans to calculate ahead comes from. In transformers? Who knows.