Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell

TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety.

Teaser for the results

(This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for my reflections.)

Do neural networks learn to implement algorithms involving look-ahead or search in the wild? Or do they only ever learn simple heuristics? We investigate this question for Leela Chess Zero, arguably the strongest existing chess-playing network.

We find intriguing evidence of learned look-ahead in a single forward pass. This section showcases some of our results, see our paper for much more.

Setup

We consider chess puzzles such as the following:

Puzzle
In the initial board state, white sacrifices the knight on g6. Black has no choice but to capture it (second state) since the white queen prevents the king from going to g8. Then white can move the rook to h4 (third state), delivering checkmate.

We focus on the policy network of Leela, which takes in a board state and outputs a distribution over moves. With only a single forward pass per board state, it can solve puzzles like the above. (You can play against the network on Lichess to get a sense of how strong it is—its rating there is over 2600.) Humans and manually written chess engines rely on look-ahead to play chess this well; they consider future moves when making a decision. But is the same thing true for Leela?

Activations associated with future moves are crucial

One of our early experiments was to do activation patching. We patch a small part of Leela’s activations from the forward pass of a corrupted version of a puzzle into the forward pass on the original puzzle board state. Measuring the effect on the final output tells us how important that part of Leela’s activations was.

Leela is a transformer that treats every square of the chess board like a token in a language model. One type of intervention we can thus do is to patch the activation on a single square in a single layer:

Surprisingly, we found that the target square of the move two turns in the future (what we call the 3rd move target square) often stores very important information. This does not happen in every puzzle, but it does in a striking fraction, and the average effect is much bigger than that of patching on most other squares:

Activation patching results
Top row: The impact of activation patching on one square and in one layer at a time in an example puzzle. Darker squares mean that patching on that square had a higher impact on the output. The 3rd move target square (blue dot) is very important in layer 10 (middle board) in some puzzles. Bottom row: Average effects over 22k puzzles. Around layer 10, the effect of patching on the 3rd move target (blue line) is big compared to most other squares (the gray line is the maximum effect over all other squares than the 1st/​3rd move target and corrupted square(s).).

The corrupted square(s) and the 1st move target square are also important (in early and late layers respectively), but we expected as much from Leela’s architecture. In contrast, the 3rd move target square stands out in middle layers, and we were much more surprised by its importance.

In the paper, we take early steps toward understanding how the information stored on the 3rd move target square is being used. For example, we find a single attention head that often moves information from this future target square backward in time to the 1st move target square.

Probes can predict future moves

If Leela uses look-ahead, can we explicitly predict future moves from its activations? We train simple, bilinear probes on parts of Leela’s activations to predict the move two turns into the future (on a set of puzzles where Leela finds a single clearly best continuation). Our probe architecture is motivated by our earlier results—it predicts whether a given square is the target square of the 3rd move since, as we’ve seen, this seems to be where Leela stores important information.

We find that this probe can predict the move 2 turns in the future quite reliably (with 92% accuracy in layer 12).

Probing results
Results for a bilinear probe trained to predict the best move two turns into the future.

More results

Our paper has many more details and results than the ones we present here. For example, we find attention heads that attend to valid piece movements and seem to play an important role for look-ahead. Go take a look!

In the grand scheme of things, we still understand very little about how Leela works. Look-ahead seems to play an important role, but we don’t know much about exactly how that look-ahead is implemented. That might be an interesting direction for future research.

Piece movement patterns
Attention patterns of random examples of piece movement heads we identified in Leela. One of the roles of these heads seems to be determining the consequences of future moves.

The origins of this project

(The rest of this post are my personal reflections, which my co-authors might not endorse.)

My primary motivation for this project was not specifically search or look-ahead but to interpret complex algorithms in neural networks at a high level of abstraction:

  • Compared to low-level mechanistic interpretability, which often focuses on either very simple networks or very specific behaviors in complex networks, I wanted to understand relatively complex behaviors.

  • That said, I did want to understand algorithms rather than just learn that some particular feature is represented.

  • In exchange for understanding complex algorithms, I was happy for that understanding to be shoddier. The nicer way to say this is “studying the network at a high level of abstraction.”

I had been thinking conceptually a bit about what such “high-level explanations” could look like and how we could become confident in such explanations directly without going through more detailed low-level explanations. For example, causal scrubbing and similar methods define a rather rigorous standard for what a “good explanation” is. They require specifying the interpretability hypothesis as a specific computational graph, as well as identifying parts of the network with parts of the interpretability hypothesis. Can we have a similarly rigorous definition of “good high-level explanation” (even if the explanation itself is much less detailed and perhaps less rigorous)? This agenda has some spiritual similarities, though I was much less focused on objectives specifically.

I was unsure whether thinking about this would lead to anything useful or whether it would, at best, result in some nice theory without much relevance to actual interpretability research. So, I decided that it would be useful to just try making progress on a “high-level” interpretability problem with existing methods, see where I got stuck, and then develop new ideas specifically to deal with those obstacles.

Entirely separately, I heard that gpt-3.5-turbo-instruct was quite strong at chess—strong enough that it seemed plausible to me that it would need to implement some form of internal learned search. I later found out that Leela’s policy network was significantly stronger (maybe around 2400 FIDE Elo, though it’s tricky to estimate). I felt pretty convinced that any network this strong (and as good at solving puzzles as Leela is) had to do something search-like. Studying that with interpretability seemed interesting in its own right and was a nice example of answering a “high-level” question about model internals: Does Leela use search? How is that combined with the heuristics it has surely learned as well? How deep and wide is the search tree?

Theories of change

When I started this project, I had three theories of change in mind. I’ll give percentages for how much of my motivation each of these contributed (don’t take those too seriously):

  1. (~35%) Get hands-on experience trying to do “high-level” interpretability to figure out the main obstacles to that in practice (and then maybe address them in follow-up work).

  2. (~10%) Get a simple but real model organism of learned search.

  3. (~10%) Find out whether learned search happens naturally (in a case like chess, where it seems relatively favored but which also wasn’t explicitly designed to make it a certainty we’d find learned search).

A big chunk of the remaining ~45% was that it seemed like a fun and intrinsically interesting project, plus various other factors not directly about the value of the research output (like upskilling).

How it went

Relative to my original expectations, we found pretty strong evidence of look-ahead (which I’d distinguish from search, see below). However, I don’t think we made much progress on actually understanding how Leela works.

Going into the project, I thought it was quite likely that Leela was using some form of search, but I was much less sure whether we could find clear mechanistic signs of it or whether the network would just be too much of a mess. Implicitly, I assumed that our ability to find evidence of search would be closely connected to our ability to understand the network. In hindsight, that was a bad assumption. It was surprisingly easy to find decent evidence of look-ahead without understanding much about algorithms implemented by Leela (beyond the fact that it sometimes uses look-ahead).

One of my main motivations was getting a better sense of practical obstacles to understanding high-level algorithms in networks. I think that part went ok but not great. I’ve probably gained some intuitions that every experienced mech interp researcher already had. We also learned a few things that seem more specific to understanding complex behaviors, and which might be of interest to other researchers (discussed in the next section). However, I don’t feel like I learned a lot about formalizing “good high-level explanations” yet. It’s plausible that if I now went back to more conceptual research on this topic, my hands-on experience would help, but I don’t know how much.

One reason we didn’t make more progress on understanding Leela was probably that I had no interpretability experience before this project. I spent maybe ~3-4 months of full-time work on it (spread over ~7 months), and towards the end of that, I was definitely making progress more quickly than at the beginning (though part of that was being more familiar with the specific model and having better infrastructure, rather than generally getting better at mech interp). I feel optimistic that with another 3 months of work, we could understand something more meaningful about how Leela implements and uses look-ahead. But I’m unsure exactly how much progress we’d make, and I’m not sure it’s worth it.

Our paper is careful to always talk about “look-ahead,” whereas most readers likely think about “search” more often, so I want to distinguish the two. All the experiments in our paper focus on cases with a single clearly best line of play, and we show that Leela represents future moves along that line of play; that’s what I mean by “look-ahead.” We do not show that Leela compares multiple different possible lines of play, which seems like an important ingredient for “search.”

I strongly suspect that Leela does, in fact, sometimes compare multiple future lines (and we have some anecdotal evidence for this that was harder to turn into systematic experiments than our look-ahead results). But in principle, you could also imagine that Leela would consider a single promising line and, if it concludes that the line is bad, heuristically choose some “safe” alternative move. That would be an example of look-ahead that arguably isn’t “search,” which is why we use the look-ahead terminology.

Separately, any type of search Leela might implement would be chess-specific and likely involve many domain heuristics. In particular, Leela could implement search without explicitly representing the objective of winning at chess anywhere, more below.

Takeaways for interpretability

The first subsection below describes a technique that I think could be useful for mech interp broadly (using a weaker model to filter inputs and automatically find “interesting” corruptions for activation patching). The other takeaways are less concrete but might be interesting for people getting into the field.

Creating an input distribution using a weaker model

Very specific behaviors (such as IOI) often correspond to a crisp, narrow input distribution (such as sentences with a very specific syntactic form). In contrast, we didn’t want to understand one specific behavior; we wanted to understand whether and how Leela might use search, i.e., a mechanism that could play a role in many different narrow behaviors.

We expected that search would play an especially big role in highly “tactical” positions (meaning there are concrete forcing lines of play that need to be considered to find the best move). So we started by using a dataset of tactics puzzles as our input distribution. We got a few promising results in this setting, but they were very noisy, and effect sizes were often small. I think the reason was that many of these tactics puzzles were still “not tactical enough” in the sense that they were pretty easy to solve using pattern matching.

We eventually settled on discarding any inputs where a much smaller and weaker model could also find the correct solution. This made our results instantly cleaner—things we’d previously observed on some fraction of inputs now happened more reliably. We also had to narrow the input distribution in additional chess-specific ways; for example, we wanted to show that Leela internally represents future moves, so we filtered for inputs where those moves were even predictable in principle with reasonably high confidence.

I think the technique of using a smaller model to filter inputs is interesting beyond just chess. Essentially, understanding the model on this distribution corresponds to understanding the “behavior” of outperforming the smaller model. This seems like a good way of focusing attention on the most “interesting” parts of the model, ignoring simple cognition/​behaviors that are also present in smaller models.

We applied the same idea to finding “interesting corruptions” for activation patching automatically. If we just patched using a random sample from our dataset, many parts of the model seemed important, so this didn’t help localize interesting components much. We observed that manually making a small change to a position that influenced the best move in a “non-obvious” way gave us much more useful activation patching results. The weaker model let us automate that procedure by searching for small modifications to an input that had a strong effect on the big model’s output but only a small effect on the weak model’s output. This lets us localize model components that are important for explaining why the strong model outperforms the weak model.

We relied on established mech interp tools more than expected

I originally thought we’d have to come up with new techniques to make much progress on finding evidence of look-ahead. Instead, our results use well-established techniques like activation patching and probing. (The main exceptions might be how we created our input distribution, as just described, and that our probes have a somewhat uncommon architecture.) It’s worth noting that we didn’t make too much progress on actual understanding IMO, so it’s still possible that this would require totally new techniques. But overall, existing techniques are (in hindsight unsurprisingly) very general, and most of the insights were about applying them in very specific ways.

Probing for complex things is difficult

I think this is pretty well-known (see e.g. Neel Nanda’s OthelloGPT work), but it was a bigger obstacle than I originally expected. The first idea we had for this project was to probe for representations of future board states. But if you’re training a linear probe, then it really matters how you represent this future board state in your ground truth label; intuitively similar representations might not be linear transforms of each other. Also, what if there are multiple plausible future board states? Would the model have a linear representation of “the most likely future board state?” Or would the probability of any plausible future board state be linearly extractable? Or would there be representations of future board states conditioned on specific moves?

There are many more angles of attack than time to pursue them all

This is true in research in general, but I found it true in this project to a much larger extent than in previous non-interpretability projects. I’m not sure how much of this is specific to Leela and how much is about interpretability in general. We had a lot of random observations about the model that we never got around to investigating in detail. For example, there is one attention head that seems to attend to likely moves by the opponent, but it didn’t even make it into the paper. Often, the obstacle was turning anecdotal observations into more systematic results. In particular, studying some types of mechanisms required inputs or corruptions with very specific properties—we could manually create a few of these inputs, but automating the process or manually generating a large dataset would have taken much longer. There were also many methods we didn’t get around to, such as training SAEs.

One takeaway from this is that being able to iterate quickly is important. But it also seems possible (and even more important) to improve a lot at prioritizing between different things. At the end of this project, the experiments I decided to run had interesting results significantly more often than early on. I think a lot of this was familiarity with the model and data, so there might be big advantages to working on a single model for a long time. But of course, the big disadvantage is that you might just overfit to that model.

Good infrastructure is extremely helpful

Others have said this before, but it’s worth repeating. Unlike when working with language models, we initially had no good instrumentation for Leela. We spent significant time building that ourselves, and then later on, we made Leela compatible with nnsight and built additional helper functions on top of that. All of this was very helpful for quickly trying out ideas. Part of good infrastructure is good visualization (e.g., we had helper functions for plotting attention patterns or attributions on top of chessboards in various ways). See our code if you’re interested in using any of this infrastructure for follow-up projects, and also feel free to reach out to me.

Relevance to AI safety

Earlier, I mentioned three theories of change I had for this project:

  1. Make progress on understanding complex algorithms at a high level of abstraction.

  2. Get a simple but real model organism of learned search.

  3. Find out whether learned search happens naturally.

I’m still decently excited about interpreting high-level algorithms (1.); both about research that just directly tries to do that and research that tries to find better frameworks and methods for it. Ideally, these should go hand in hand—in particular, I think it’s very easy to go off in useless directions when doing purely conceptual work.

However, I do think there are challenges to the theory of change for this “high-level algorithms” interpretability:

  • If a vague high-level understanding was all we ever got, I’m skeptical that would be directly useful for safety (at least, I’m not aware of any specific, compelling use case).

  • We might hope to understand specific safety-relevant parts of the network in more detail and use a vague high-level understanding to find those parts or integrate our understanding of them into an overall picture. I think for many versions of this, it might be much easier to find relevant parts with probing or other localization methods, and a high-level understanding of how those parts are used might not be very important.

  • If the goal is to fully understand neural networks, then I’m actually pretty excited about using this as a “top-down” approach that might meet in the middle with a “bottom-up” approach that tries to understand simpler behaviors rigorously. However, that goal seems very far away for now.

I’d still be tentatively excited for more safety-motivated interpretability researchers to directly try to make progress on gaining some high-level understanding of complex network behaviors. However, other parts of interpretability might be even more important on the margin, and interpretability as a whole is arguably already overrepresented among people motivated by existential safety.

My other motivations were directly related to learned search: having a “model organism” to study and just figuring out whether it even occurs naturally. I was less excited about these from the start, mainly because I did not expect to find search with an explicit compact representation of the objective. Typical safety reasons to be interested in learned search or learned optimization apply to such compact representations of an objective or, in other words, retargetable, general-purpose search. For example, the definition of “optimizer” from Risks from Learned Optimization mentions this explicit representation, and of course, retargeting the search requires a “goal slot” as well. While we didn’t explicitly look for retargetable search in Leela, it seems quite unlikely to me that it exists there.

Overall, I think the project went pretty well from a scientific perspective but doesn’t look great in terms of AI safety impact. I think this is due to a mix of:

  • When starting the project, I didn’t think about the theory of change in that much detail, and after some more thought over the last months, it now looks somewhat worse to me than when I started.

  • I didn’t select the project purely based on its direct AI safety impact (e.g., I also thought it would be fun and productive to work on and that it would be good for upskilling, and I think these all worked out well).

I currently don’t have concrete plans to do follow-up work myself. That said, I think trying to find out more about Leela (or similar work) could make sense for some people/​under some worldviews. As I mentioned, I think there’s a lot of relatively low-hanging fruit that we just didn’t get around to. If you want to work on that and would like to chat, feel free to reach out!