Evidence against Learned Search in a Chess-Playing Neural Network

Introduction

There is a new paper and lesswrong post about “learned look-ahead in a chess-playing neural network”. This has long been a research interest of mine for reasons that are well-stated in the paper:

Can neural networks learn to use algorithms such as look-ahead or search internally? Or are they better thought of as vast collections of simple heuristics or memorized data? Answering this question might help us anticipate neural networks’ future capabilities and give us a better understanding of how they work internally.

and further:

Since we know how to hand-design chess engines, we know what reasoning to look for in chess-playing networks. Compared to frontier language models, this makes chess a good compromise between realism and practicality for investigating whether networks learn reasoning algorithms or rely purely on heuristics.

So the question is whether Francois Chollet is correct with transformers doing “curve fitting” i.e. memorisation with little generalisation or whether they learn to “reason”. “Reasoning” is a fuzzy word, but in chess you can at least look for what human players call “calculation”, that is the ability to execute moves solely in your mind to observe and evaluate the resulting position.

To me this is a crux as to whether large language models will scale to human capabilities without further algorithmic breakthroughs.

The paper’s authors, which include Erik Jenner and Stuart Russell, conclude that the policy network of Leela Chess Zero (a top engine and open source replication of AlphaZero) does learn look-ahead.

Using interpretability techniques they “find that Leela internally represents future optimal moves and that these representations are crucial for its final output in certain board states.”

While the term “look-ahead” is fuzzy, the paper clearly intends to show that the Leela network implements an “algorithm” and a form of “reasoning”.

My interpretation of the presented evidence is different, as discussed in the comments of the original lesswrong post. I argue that all the evidence is completely consistent with Leela having learned to recognise multi-move patterns. Multi-move patterns are just complicated patterns that take into account that certain pieces will have to be able to move to certain squares in future moves for the pattern to hold.

The crucial different to having learned an algorithm:

An algorithm can take different inputs and do its thing. That allows generalisation to unseen or at least unusual inputs. This means that less data is necessary for learning because the generalisation power is much higher.

Learning multi-move patterns on the other hand requires much more data because the network needs to see many versions of the pattern until it knows all specific details that have to hold.

Analysis setup

Unfortunately it is quite difficult to distinguish between these two cases. As I argued:

Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.

However, I propose an experiment, that makes it clear that there is a difference.

Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.

Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?

If it relies on pattern recognition it wouldn’t—it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.

At the time I thought this is just a thought experiment to get my point across. But after looking at the code that was used for the analysis in the paper, I realised that something quite similar could be done with the Leela network.

The Leela network is not just a policy network, but also a value network. Similar to AlphaGo and Co it computes not just a ranking of moves but also an evaluation of the position in the form of win, draw, loss probabilities.

This allows us to analyse whether the Leela network “sees” the correct outcome when it predicts the correct move. If it picks the correct first move of a mating combination because it has seen the mate, then it should also predict the mate and therefore a high winning probability. If it guesses the first move based on pattern recognition it might be oblivious to the mate and predict only a moderate or even low probability of winning.

The Dataset

To conduct this analysis I scrape 193704 chess problems from the website of the German Chess Composition Association “Schwalbe”. These are well-suited for this test because chess compositions are somewhat out of distribution for a chess-playing network, so a lack of generalisation should be more noticeable. They are usually designed to require “reasoning” and to be hard to guess.

However the dataset requires extensive filtering to remove “fairy chess” with made-up rules and checking the solution using stockfish, leaving 54424 validated puzzles with normal rules. All of them are white to move and win, often with a mate in n moves.

One further complication is that a mate-in-n puzzle often features an overwhelming advantage for white and the difficulty lies in finding the fastest win, something Leela was not trained to do. So I filter the puzzles down to 1895 puzzles that have just one winning move. 1274 of those are mate-in-n with n<10.

Analysis results

The Leela network is pretty amazing. If we accept a correctly predicted first move as “solution”, it solves a bit more than 50% of the puzzles. In the following we try to dig into whether this is due to amazing “intuition”, i.e. pattern recognition based guesses or due to a look-ahead-algorithm.

Accuracy by depth

Humans solve these puzzles by reasoning and calculation. They think ahead until they find the mate. As a consequence mating puzzles get harder as the mating combination gets deeper. This is of course also true for search-based engines. Mate-in-2 is almost always solvable for me, because it is close to being brute-forceable. Mate-in-3 is already often much harder. Longer mates can become arbitrarily hard, though of course there are many factors that make puzzles easy or hard and depth is just one of them.

If Leela’s abilities were substantially founded on the ability to look ahead and find the mate, we would expect a similar pattern: Deeper mates would be harder to solve than shallower mates.

This is not what we find. Overall the deeper mates are more often solved by the Leela network. This makes sense from a pattern recognition based move prediction perspective, because shorter mates probably have more surprising initial moves—the composer doesn’t have as many moves later to cram in aesthetic value.

Winning probabilities by depth

Similarly humans tend to get less confident in their solution the more moves these entail. Obviously even if the line is completely forced, a mate-in-8 gives twice as many opportunities to overlook something as a mate-in-4. Additionally, the farther the imagined board state is from the actual position the more likely it is that mental errors creep in like taken or moved pieces reappearing on their original square.

Again, this is not what we find for the Leela network. For the solved mates the predicted winning probability hovers around 40% independent of the depth of the mate.

For unsolved mates (remember, these are filtered to have only one winning move, which in this case Leela missed) the probability hovers around 30%.

This is consistent with Leela assessing a kind of dynamic potential by recognising many tactical motives that might be strung together for an advantageous tactical strike without actually determining this winning combination.

Winning probabilty distributions

Overall the winning probabilities show no Aha!-effect, where the network becomes significantly more confident in its winning prospects when it sees that a move is winning. This would certainly be the case for a human or a search-based engine. The Leela-network does not show a big difference between the winning probability distributions of solved vs unsolved puzzles.

Winning probability by material balance

The observed difference between these probability distributions might also be due to differences between the solved and unsolved puzzles and are unlikely to be caused by “finding” the solution.

This becomes clearer if we look at one superficial but powerful predictor of game outcomes: Material balance. In most positions the player with more material has the better prospects and humans would also assess the material balance first when encountering a new position.

However, one of the strengths of calculating ahead lies in the ability to ignore or transcend the material balance when concrete lines show a way to a favourable outcome. A human or search-based engine might initially think that black is far ahead only to flip to “white is winning” when finding a mating combination.

Here is the average winning probability by material balance (in pawn units) for correctly solved puzzles with just one winning solution.

Despite the fact that all these puzzles are winning for white and the solution has been predicted by Leela the average winning probability drops to zero when the material balance becomes too unfavourable.

Conclusion

Does this analysis show without a doubt that the Leela network does not do some kind of general search or look-ahead during its forward pass?

No, unfortunately my results are also consistent with Leela implementing a mixture of pattern recognition and a look-ahead algorithm, with pattern recognition doing most of the heavy lifting and the general look-ahead just occasionally also contributing to solving a puzzle (these are not easy after all).

A clear proof of absence of system 2 thinking would require control over the training data for different training runs or significantly more powerful interpretability methods.

But I think it can be ruled out that a substantial part of Leela network’s prowess in solving chess puzzles or predicting game outcome is due to deliberate calculation.

There are more analyses that could be done, however, I don’t have the time. So far the analysis results have not shifted my priors much.

However, the transfer of these results to LLMs is not clear-cut because LLMs are not similarly limited to a single forward pass in their problem solving.