Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell
TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety.
Teaser for the results
(This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for my reflections.)
Do neural networks learn to implement algorithms involving look-ahead or search in the wild? Or do they only ever learn simple heuristics? We investigate this question for Leela Chess Zero, arguably the strongest existing chess-playing network.
We find intriguing evidence of learned look-ahead in a single forward pass. This section showcases some of our results, see our paper for much more.
Setup
We consider chess puzzles such as the following:
We focus on the policy network of Leela, which takes in a board state and outputs a distribution over moves. With only a single forward pass per board state, it can solve puzzles like the above. (You can play against the network on Lichess to get a sense of how strong it is—its rating there is over 2600.) Humans and manually written chess engines rely on look-ahead to play chess this well; they consider future moves when making a decision. But is the same thing true for Leela?
Activations associated with future moves are crucial
One of our early experiments was to do activation patching. We patch a small part of Leela’s activations from the forward pass of a corrupted version of a puzzle into the forward pass on the original puzzle board state. Measuring the effect on the final output tells us how important that part of Leela’s activations was.
Leela is a transformer that treats every square of the chess board like a token in a language model. One type of intervention we can thus do is to patch the activation on a single square in a single layer:
Surprisingly, we found that the target square of the move two turns in the future (what we call the 3rd move target square) often stores very important information. This does not happen in every puzzle, but it does in a striking fraction, and the average effect is much bigger than that of patching on most other squares:
The corrupted square(s) and the 1st move target square are also important (in early and late layers respectively), but we expected as much from Leela’s architecture. In contrast, the 3rd move target square stands out in middle layers, and we were much more surprised by its importance.
In the paper, we take early steps toward understanding how the information stored on the 3rd move target square is being used. For example, we find a single attention head that often moves information from this future target square backward in time to the 1st move target square.
Probes can predict future moves
If Leela uses look-ahead, can we explicitly predict future moves from its activations? We train simple, bilinear probes on parts of Leela’s activations to predict the move two turns into the future (on a set of puzzles where Leela finds a single clearly best continuation). Our probe architecture is motivated by our earlier results—it predicts whether a given square is the target square of the 3rd move since, as we’ve seen, this seems to be where Leela stores important information.
We find that this probe can predict the move 2 turns in the future quite reliably (with 92% accuracy in layer 12).
More results
Our paper has many more details and results than the ones we present here. For example, we find attention heads that attend to valid piece movements and seem to play an important role for look-ahead. Go take a look!
In the grand scheme of things, we still understand very little about how Leela works. Look-ahead seems to play an important role, but we don’t know much about exactly how that look-ahead is implemented. That might be an interesting direction for future research.
The origins of this project
(The rest of this post are my personal reflections, which my co-authors might not endorse.)
My primary motivation for this project was not specifically search or look-ahead but to interpret complex algorithms in neural networks at a high level of abstraction:
Compared to low-level mechanistic interpretability, which often focuses on either very simple networks or very specific behaviors in complex networks, I wanted to understand relatively complex behaviors.
That said, I did want to understand algorithms rather than just learn that some particular feature is represented.
In exchange for understanding complex algorithms, I was happy for that understanding to be shoddier. The nicer way to say this is “studying the network at a high level of abstraction.”
I had been thinking conceptually a bit about what such “high-level explanations” could look like and how we could become confident in such explanations directly without going through more detailed low-level explanations. For example, causal scrubbing and similar methods define a rather rigorous standard for what a “good explanation” is. They require specifying the interpretability hypothesis as a specific computational graph, as well as identifying parts of the network with parts of the interpretability hypothesis. Can we have a similarly rigorous definition of “good high-level explanation” (even if the explanation itself is much less detailed and perhaps less rigorous)? This agenda has some spiritual similarities, though I was much less focused on objectives specifically.
I was unsure whether thinking about this would lead to anything useful or whether it would, at best, result in some nice theory without much relevance to actual interpretability research. So, I decided that it would be useful to just try making progress on a “high-level” interpretability problem with existing methods, see where I got stuck, and then develop new ideas specifically to deal with those obstacles.
Entirely separately, I heard that gpt-3.5-turbo-instruct was quite strong at chess—strong enough that it seemed plausible to me that it would need to implement some form of internal learned search. I later found out that Leela’s policy network was significantly stronger (maybe around 2400 FIDE Elo, though it’s tricky to estimate). I felt pretty convinced that any network this strong (and as good at solving puzzles as Leela is) had to do something search-like. Studying that with interpretability seemed interesting in its own right and was a nice example of answering a “high-level” question about model internals: Does Leela use search? How is that combined with the heuristics it has surely learned as well? How deep and wide is the search tree?
Theories of change
When I started this project, I had three theories of change in mind. I’ll give percentages for how much of my motivation each of these contributed (don’t take those too seriously):
(~35%) Get hands-on experience trying to do “high-level” interpretability to figure out the main obstacles to that in practice (and then maybe address them in follow-up work).
(~10%) Get a simple but real model organism of learned search.
(~10%) Find out whether learned search happens naturally (in a case like chess, where it seems relatively favored but which also wasn’t explicitly designed to make it a certainty we’d find learned search).
A big chunk of the remaining ~45% was that it seemed like a fun and intrinsically interesting project, plus various other factors not directly about the value of the research output (like upskilling).
How it went
Relative to my original expectations, we found pretty strong evidence of look-ahead (which I’d distinguish from search, see below). However, I don’t think we made much progress on actually understanding how Leela works.
Going into the project, I thought it was quite likely that Leela was using some form of search, but I was much less sure whether we could find clear mechanistic signs of it or whether the network would just be too much of a mess. Implicitly, I assumed that our ability to find evidence of search would be closely connected to our ability to understand the network. In hindsight, that was a bad assumption. It was surprisingly easy to find decent evidence of look-ahead without understanding much about algorithms implemented by Leela (beyond the fact that it sometimes uses look-ahead).
One of my main motivations was getting a better sense of practical obstacles to understanding high-level algorithms in networks. I think that part went ok but not great. I’ve probably gained some intuitions that every experienced mech interp researcher already had. We also learned a few things that seem more specific to understanding complex behaviors, and which might be of interest to other researchers (discussed in the next section). However, I don’t feel like I learned a lot about formalizing “good high-level explanations” yet. It’s plausible that if I now went back to more conceptual research on this topic, my hands-on experience would help, but I don’t know how much.
One reason we didn’t make more progress on understanding Leela was probably that I had no interpretability experience before this project. I spent maybe ~3-4 months of full-time work on it (spread over ~7 months), and towards the end of that, I was definitely making progress more quickly than at the beginning (though part of that was being more familiar with the specific model and having better infrastructure, rather than generally getting better at mech interp). I feel optimistic that with another 3 months of work, we could understand something more meaningful about how Leela implements and uses look-ahead. But I’m unsure exactly how much progress we’d make, and I’m not sure it’s worth it.
Sidenote: look-ahead vs search
Our paper is careful to always talk about “look-ahead,” whereas most readers likely think about “search” more often, so I want to distinguish the two. All the experiments in our paper focus on cases with a single clearly best line of play, and we show that Leela represents future moves along that line of play; that’s what I mean by “look-ahead.” We do not show that Leela compares multiple different possible lines of play, which seems like an important ingredient for “search.”
I strongly suspect that Leela does, in fact, sometimes compare multiple future lines (and we have some anecdotal evidence for this that was harder to turn into systematic experiments than our look-ahead results). But in principle, you could also imagine that Leela would consider a single promising line and, if it concludes that the line is bad, heuristically choose some “safe” alternative move. That would be an example of look-ahead that arguably isn’t “search,” which is why we use the look-ahead terminology.
Separately, any type of search Leela might implement would be chess-specific and likely involve many domain heuristics. In particular, Leela could implement search without explicitly representing the objective of winning at chess anywhere, more below.
Takeaways for interpretability
The first subsection below describes a technique that I think could be useful for mech interp broadly (using a weaker model to filter inputs and automatically find “interesting” corruptions for activation patching). The other takeaways are less concrete but might be interesting for people getting into the field.
Creating an input distribution using a weaker model
Very specific behaviors (such as IOI) often correspond to a crisp, narrow input distribution (such as sentences with a very specific syntactic form). In contrast, we didn’t want to understand one specific behavior; we wanted to understand whether and how Leela might use search, i.e., a mechanism that could play a role in many different narrow behaviors.
We expected that search would play an especially big role in highly “tactical” positions (meaning there are concrete forcing lines of play that need to be considered to find the best move). So we started by using a dataset of tactics puzzles as our input distribution. We got a few promising results in this setting, but they were very noisy, and effect sizes were often small. I think the reason was that many of these tactics puzzles were still “not tactical enough” in the sense that they were pretty easy to solve using pattern matching.
We eventually settled on discarding any inputs where a much smaller and weaker model could also find the correct solution. This made our results instantly cleaner—things we’d previously observed on some fraction of inputs now happened more reliably. We also had to narrow the input distribution in additional chess-specific ways; for example, we wanted to show that Leela internally represents future moves, so we filtered for inputs where those moves were even predictable in principle with reasonably high confidence.
I think the technique of using a smaller model to filter inputs is interesting beyond just chess. Essentially, understanding the model on this distribution corresponds to understanding the “behavior” of outperforming the smaller model. This seems like a good way of focusing attention on the most “interesting” parts of the model, ignoring simple cognition/behaviors that are also present in smaller models.
We applied the same idea to finding “interesting corruptions” for activation patching automatically. If we just patched using a random sample from our dataset, many parts of the model seemed important, so this didn’t help localize interesting components much. We observed that manually making a small change to a position that influenced the best move in a “non-obvious” way gave us much more useful activation patching results. The weaker model let us automate that procedure by searching for small modifications to an input that had a strong effect on the big model’s output but only a small effect on the weak model’s output. This lets us localize model components that are important for explaining why the strong model outperforms the weak model.
We relied on established mech interp tools more than expected
I originally thought we’d have to come up with new techniques to make much progress on finding evidence of look-ahead. Instead, our results use well-established techniques like activation patching and probing. (The main exceptions might be how we created our input distribution, as just described, and that our probes have a somewhat uncommon architecture.) It’s worth noting that we didn’t make too much progress on actual understanding IMO, so it’s still possible that this would require totally new techniques. But overall, existing techniques are (in hindsight unsurprisingly) very general, and most of the insights were about applying them in very specific ways.
Probing for complex things is difficult
I think this is pretty well-known (see e.g. Neel Nanda’s OthelloGPT work), but it was a bigger obstacle than I originally expected. The first idea we had for this project was to probe for representations of future board states. But if you’re training a linear probe, then it really matters how you represent this future board state in your ground truth label; intuitively similar representations might not be linear transforms of each other. Also, what if there are multiple plausible future board states? Would the model have a linear representation of “the most likely future board state?” Or would the probability of any plausible future board state be linearly extractable? Or would there be representations of future board states conditioned on specific moves?
There are many more angles of attack than time to pursue them all
This is true in research in general, but I found it true in this project to a much larger extent than in previous non-interpretability projects. I’m not sure how much of this is specific to Leela and how much is about interpretability in general. We had a lot of random observations about the model that we never got around to investigating in detail. For example, there is one attention head that seems to attend to likely moves by the opponent, but it didn’t even make it into the paper. Often, the obstacle was turning anecdotal observations into more systematic results. In particular, studying some types of mechanisms required inputs or corruptions with very specific properties—we could manually create a few of these inputs, but automating the process or manually generating a large dataset would have taken much longer. There were also many methods we didn’t get around to, such as training SAEs.
One takeaway from this is that being able to iterate quickly is important. But it also seems possible (and even more important) to improve a lot at prioritizing between different things. At the end of this project, the experiments I decided to run had interesting results significantly more often than early on. I think a lot of this was familiarity with the model and data, so there might be big advantages to working on a single model for a long time. But of course, the big disadvantage is that you might just overfit to that model.
Good infrastructure is extremely helpful
Others have said this before, but it’s worth repeating. Unlike when working with language models, we initially had no good instrumentation for Leela. We spent significant time building that ourselves, and then later on, we made Leela compatible with nnsight and built additional helper functions on top of that. All of this was very helpful for quickly trying out ideas. Part of good infrastructure is good visualization (e.g., we had helper functions for plotting attention patterns or attributions on top of chessboards in various ways). See our code if you’re interested in using any of this infrastructure for follow-up projects, and also feel free to reach out to me.
Relevance to AI safety
Earlier, I mentioned three theories of change I had for this project:
Make progress on understanding complex algorithms at a high level of abstraction.
Get a simple but real model organism of learned search.
Find out whether learned search happens naturally.
I’m still decently excited about interpreting high-level algorithms (1.); both about research that just directly tries to do that and research that tries to find better frameworks and methods for it. Ideally, these should go hand in hand—in particular, I think it’s very easy to go off in useless directions when doing purely conceptual work.
However, I do think there are challenges to the theory of change for this “high-level algorithms” interpretability:
If a vague high-level understanding was all we ever got, I’m skeptical that would be directly useful for safety (at least, I’m not aware of any specific, compelling use case).
We might hope to understand specific safety-relevant parts of the network in more detail and use a vague high-level understanding to find those parts or integrate our understanding of them into an overall picture. I think for many versions of this, it might be much easier to find relevant parts with probing or other localization methods, and a high-level understanding of how those parts are used might not be very important.
If the goal is to fully understand neural networks, then I’m actually pretty excited about using this as a “top-down” approach that might meet in the middle with a “bottom-up” approach that tries to understand simpler behaviors rigorously. However, that goal seems very far away for now.
I’d still be tentatively excited for more safety-motivated interpretability researchers to directly try to make progress on gaining some high-level understanding of complex network behaviors. However, other parts of interpretability might be even more important on the margin, and interpretability as a whole is arguably already overrepresented among people motivated by existential safety.
My other motivations were directly related to learned search: having a “model organism” to study and just figuring out whether it even occurs naturally. I was less excited about these from the start, mainly because I did not expect to find search with an explicit compact representation of the objective. Typical safety reasons to be interested in learned search or learned optimization apply to such compact representations of an objective or, in other words, retargetable, general-purpose search. For example, the definition of “optimizer” from Risks from Learned Optimization mentions this explicit representation, and of course, retargeting the search requires a “goal slot” as well. While we didn’t explicitly look for retargetable search in Leela, it seems quite unlikely to me that it exists there.
Overall, I think the project went pretty well from a scientific perspective but doesn’t look great in terms of AI safety impact. I think this is due to a mix of:
When starting the project, I didn’t think about the theory of change in that much detail, and after some more thought over the last months, it now looks somewhat worse to me than when I started.
I didn’t select the project purely based on its direct AI safety impact (e.g., I also thought it would be fun and productive to work on and that it would be good for upskilling, and I think these all worked out well).
I currently don’t have concrete plans to do follow-up work myself. That said, I think trying to find out more about Leela (or similar work) could make sense for some people/under some worldviews. As I mentioned, I think there’s a lot of relatively low-hanging fruit that we just didn’t get around to. If you want to work on that and would like to chat, feel free to reach out!
- Pacing Outside the Box: RNNs Learn to Plan in Sokoban by 25 Jul 2024 22:00 UTC; 59 points) (
- Evidence against Learned Search in a Chess-Playing Neural Network by 13 Sep 2024 11:59 UTC; 56 points) (
- (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need by 3 Oct 2024 19:11 UTC; 34 points) (
Very cool project!
My impression so far was that transformer models do not learn search in chess and you are careful to only speak about lookahead. I would suggest that even that is not necessarily the case: I suspect the models learn to recognise multi-move patterns. I.e. they recognise positions that allow certain multi-move tactical strikes.
To tease search/lookahead and pattern recognition apart I started creating a benchmark with positions that are solved by surprising and unintuitive moves, but I really didn’t have any time to keep working on this idea and it has been on hold for a couple of months.
I thought I spell this out a bit:
What is the difference between multi-move pattern recognition and lookahead/search?
Lookahead/search is a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation. In the case of lookahead this general algorithm takes the move prediction and goes through the positions that arise when making the predicted moves while assessing these positions.
Multi-move pattern recognition starts out as simple pattern recognition: The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
Sometimes it will predict this move although the combination, the multi-move tactical strike, doesn’t quite work, but over time it will learn that Ng6 is unlikely when there is a black bishop on d2 ready to drop back to h6 or when there is a black knight on g2, guarding the square h4 where the rook would have to checkmate.
This is what I call multi-move pattern recognition: The network has to learn a lot of details and conditions to predict when the tactical strike (a multi-move pattern) works and when it doesn’t. In this case you would make the same observations that where described in this post: For example if you’d ablate the square h4, you’d lose the information of whether this will be available for the rook in the second move. It is important for the pattern recognition to know where future pieces have to go.
But the crucial difference to lookahead or search is that this is not a general mechanism. Quite the contrary, it is the result of increasing specialisation on this particular type of position. If you’d remove a certain tactical pattern from the training data the NN would be unable to find it.
It is exactly the ability of system 2 thinking to generalise much further than this that makes the question of whether transformers develop it so important.
I think the methods described in this post are even in principle unable to distinguish between multi-move pattern recognition and lookahead/search. Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.
Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:
You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is “effectively pinned,” and so g6 isn’t actually protected—this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.
From that perspective, I’d guess Leela sits somewhere in the middle of that spectrum. I agree it’s likely not implementing “a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation” in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for “considering future moves” and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.
There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the “pattern recognition”-network to predict that Ng6 is not a feasible option.
The “lookahead”-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like “block mate”, “move bishop/queen on a diagonal” to generalise to unseen combinations of patterns.
I still don’t see the crisp boundary you seem to be getting at between “pattern recognition building on general circuits” and what you call “look-ahead.” It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:
If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:
This doesn’t generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn’t also explicitly seen it)
This generalizes to rook checkmates along the h-file, but doesn’t generalize to rook checkmates along other files
This generalizes to arbitrary rook checkmates
This also generalizes to bishop checkmates being blocked
This also generalizes to a rook trapping the opponent queen (instead of the king)
...
(Of course, this generalization question is likely related to the question of whether these different cases share “mechanisms.”)
At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of “difficulty” (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we’d expect to actually see; for example, humans don’t do this either). You might have something different/broader in mind for “look-ahead,” but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.
I don’t think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.
Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?
If it relies on pattern recognition it wouldn’t—it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn’t really learned lookahead for 0% or a very low percentage and that’s what I would expect.
Maybe I misunderstood you then, and tbc I agree that you don’t need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)
For your thought experiment, my prediction would depend on the specifics of what this “tactical motive” looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I’m much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.
Note that 0% to 100% accuracy is not the main spectrum I’m thinking of (though I agree it’s also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that’s more specific to this experiment).
Hmm, yeah, I think we are talking past each other.
Everything you describe is just pattern recognition to me. Lookahead or search does not depend on the broadness of the motive.
Lookahead, to me, is the ability to look ahead and see what is there. It allows very high certainty even for never before seen mating combinations.
If the line is forcing enough it allows finding very deep combinations (which you will never ever find with pattern recognition because the combinatorial explosions means that basically every deep combination has never been seen before).
In humans, it is clearly different from pattern recognition. Humans can see multi-move patterns in a glance. The example in the post I would play instantly in every blitz game. I would check the conditions of the pattern, but I wouldn’t have to “look ahead”.
Humans consider future moves even when intuitively assessing positions. “This should be winning, because I still have x,y and z in the position”. But actually calculating is clearly different because it is effortful. You have to force yourself to do it (or at least I usually have to). You manipulate the position sequentially in your mind and see what could happen. This allows you to see many things that you couldn’t predict from your past experience in similar positions
I didn’t want to get hung up on whether there is a crisp boundary. Maybe you are right and you just keep generalising and generalising until there is a search algo in the limit. I very much doubt this is where the ability of humans to calculate ahead comes from. In transformers? Who knows.
The shallow lookahead seems consistent with the observation in “Grandmaster-Level Chess Without Search”, Ruoss et al 2024, that the gains seem to stop after a few layers and a relatively shallow 8-layer NN saturates. I take that as suggesting that there are optimization/architecture difficulties here for learning better lookahead / planning as an unrolled feedforward NN with no weight-sharing or explicit search scaffolding like a MuZero.
It seems like a NN ought to be able to at least learn a sort of ‘beam search’ by examining multiple possible lines of play in parallel during the feedforward (because NNs tend to have way more computational power than they need and we can see in LLMs that you can easily ask them to compute multiple responses in parallel as ‘multiplexed’ computations, so if it can do one lookahead then it ought to be able to do multiple lookaheads in parallel), so that might be something to consider looking for: can you find evidence of multiple moves being considered in parallel? And if not, does changing the arch to use tied weights potentially add that internally?
(A possible corollary of this would be per Jones’s smooth RL scaling laws of train vs search: either those scaling laws break down at some point where the internal lookahead breaks down at 8-layers or so and performance then saturates as the NN can no longer improve, or those scaling laws already incorporate the benefits of internal lookahead and so could be made much better by any improvements to the internal amortized search.)
This is really exciting to me. If this work generalizes to future models capable of sophisticated planning in the real world, we will be able to forecast future actions that internally justify an AI’s current actions and thus tell whether they’re planning to coup or not, whether or not an explicit general-purpose representation of the objective exists.
Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.
That said, some reasons this might not generalize in a way that enables this kind of application:
I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we’d want to understand the model’s cognition.
Our results are on a very narrow subset of board states (“tactically complex” ones). We already start with a filtered set of “puzzles” instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
I do think there’s an argument that these “tactically complex” states are the most interesting ones. But on the other hand, a lot of Leela’s playing strength comes from making very good decisions in “normal” states, which accumulate over the course of a game.
Chess has an extremely simple “world model” with clearly defined states and actions. And we know exactly what that world model is, so it’s easy-ish to look for relevant representations inside the network. I’d expect everything is just much messier for networks using models of the real world.
We have ground truth for the “correct” reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
I don’t think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really “trying” to win at chess, or whether it’s actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn’t a perfect analogy for planning a coup.) I don’t think our results so far would have told us.
I’m reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I think this is pretty excellent. I wonder if maybe this is one of those “it works so well it’s boring” research directions and you are massively underrating it.
This is very cool!