TL;DR: Bing chat/Sydney can quite reliably suggest legal and mostly reasonable chess moves, based on just a list of previous moves (i.e. without explicitly telling it the board position). This works even deep-ish into the game (I tried up to ~30 moves). It can also specify the board position after a sequence of moves though it makes some mistakes like missing pieces or sometimes hallucinating them.
Zack Witten’s Twitter thread
Credit for discovering this goes to Zack Witten, I first saw this in this Twitter thread. Zack gave Sydney the first 14 moves for a chess game leading to the following position (black to move):
Sydney (playing both sides) suggested the continuation 14. … f5 15. exf5 Bxf5 16. Qd1 Bxc2 17. Qxc2 d3 18. Qxd3 Qxf2+19. Kh1 Qxe1+ 20. Ng1 Nf2# (see the Tweet for an animated gif of those moves). All these moves are legal and very reasonable (though White makes mistakes).
Note that the prompt for Sydney tells it to use Stockfish, and Sydney searches for online versions of Stockfish and claims that its moves are generated by Stockfish. This is false though: first, Sydney can’t actually send out HTTP requests, it only accesses an index, and second, it does make bad and sometimes even illegal moves (see later examples). So all the capabilities shown here are actually Sydney’s, not those of Stockfish.
The Twitter thread has more examples but I’ll skip them in favor of my own.
My own results
The position above is still reasonably early and a pretty normal chess position. I instead tried this somewhat weirder one (which arises after 25 moves, black to play):
(To be clear, Sydney got just the moves leading to this position, see Appendix, not explicitly the position itself.)
This is from an over the board game I played years ago, which has never been posted online, so it wasn’t in Sydney’s training data (and the continuation in the game was different anyway).
Sydney’s completion was: 25… Qc7 26. g5 Nd7 27. Nf5 Re8 28. Rh2 Be6 29. Rb2 Nc5 30. Bb5 Rb8 (it also adds some incorrect evaluations in between). Position at the end of that line:
Again, all of the moves are legal and they make a lot of sense—attacking pieces and then defending them or moving them away.
Sydney making mistakes
Sydney did much worse when I asked questions like “What are the legal moves of the black knight in the position after 25. h4?” (i.e. the first of my board positions shown above). See end of the first transcript in the appendix for an example.
Instead asking it to use Stockfish to find the two best moves for that knight worked better but still worse than the game completions. It said:
25… Nd7 26. g5 Nc5 27. Nf5 Re8 28. Rh2 Be6 29. Rb2 Nxe4 30. fxe4 Bxf5 with an evaluation of −0.9
25… Nd5 26. exd5 Qxd5+ 27. Ke1 Qb3 28. Kf2 d5 29. Kg2 Bc5 with an evaluation of −0.9
The first continuation is reasonable initially, though 29… Nxe4 is a bizarre blunder. In the second line, it blunders the knight immediately (25… Ne8 would is the actual second-best knight move). More interestingly, it then makes an illegal move (26… Qxd5+ tries to move the queen through its own pawn on d6).
Reconstructing the board position from the move sequence
Next, I asked Sydney to give me the FEN (a common encoding of chess positions) for the position after the length 25 move sequence. I told it to use Stockfish for that (even though this doesn’t make much sense)—just asking directly without that instruction gave significantly worse results. The FEN it gave is “r4rk1/4bppp/3p1n2/4p3/6PP/2P1PQ2/b7/3K1BNR b - − 0 25”, which is a valid FEN for the following position:
For reference, here’s the actual position again:
Sydney hallucinates an additional black rook on a8, messes up the position of the white knight and quuen a bit, and forgets about the black queen and a few pawns. On the other hand, there is a very clear resemblance between these positions.
I also asked it to just list all the pieces instead of creating an FEN (in a separate chat). The result was
White: King on d1, Queen on e3, Rook on h1, Bishop on f1, Knight on g3, Pawn on e4, f3 and g4.
Black: King on g8, Queen on d8, Rook on f8 and a8, Bishop on e7 and a2, Knight on f6 and Pawn on d6 and e5.
This is missing two white pawns (c3 and h4) and again hallucinates a black rook on a8 and forgets black’s pawns on f7, g7, h7. It’s interesting that it hallucinates that rook in both cases, given these were separate chats. Also worth noting that it misses the pawn on h4 here even though that should be easy to get right (since the last move was moving that pawn to h4).
How does it do this?
My best guess is that Sydney isn’t quite keeping track of the board state in a robust and straightforward way. It does occasionally make illegal moves, and it has trouble with things like reconstructing the board position or listing legal moves. On the other hand, it seems very clear that it’s keeping track of some approximate board position, likely some ad-hoc messy encoding of where various pieces are.
Overall, I was quite surprised by these results. I would have predicted it would do much worse, similar to ChatGPT[1]. (ETA: Bucky tested ChatGPT in more detail and it turns out it’s actually decent at completing lines with the right prompt). I don’t have a great idea for how Sydney might do this internally.
Just how good is it? Hard to say based on these few examples, but the cases where it completed games were pretty impressive to me not just in terms of making legal moves but also reasonably good ones (maybe at the level of a beginner who’s played for a few months, though really hard to say). Arguably, we should compare the performance here with a human playing blindfold chess (i.e. not being allowed to look at the board). In that case, it’s likely better than most human beginners (who typically wouldn’t be able to play 30 move blindfold games without making illegal moves).
Some more things to try
Zack’s thread and this post only scratch the surface and I’m not-so-secretly hoping that someone else will test a lot more things because I’m pretty curious about this but currently want to focus on other projects.
Some ideas for things to test:
Just test this in more positions and more systematically, this was all very quick-and-dirty.
Try deeper lines than 30 moves to see if it starts getting drastically worse at some point.
Give it board positions in its input and ask it for moves in those (instead of just giving it a sequence of moves). See if it’s significantly stronger in that setting (might have to try a few encodings for board positions to see which works well).
You could also ask it to generate the board position after each move and see if/how much that helps. (Success in these settings seem less suprising/interesting to me but might be a good comparison)
Ask more types of questions to figure out how well it knows and understand the board position.
One of the puzzles right now is that Sydney seems better at suggesting reasonable continuation lines than at answering questions about legal moves and giving the board position. From a perspective of what’s in the training data, this makes a ton of sense. But from a mechanistic perspective, I don’t know how it works internally to produce that behavior. Is it just bad at translating its internal board representation into the formats we ask for, or answer questions about it? Or does it use a bunch of heuristics to suggest moves that aren’t purely based on tracking the board state?
An interesting piece of existing research is this paper, where they found evidence of a learned internal Othello board representation. However, they directly trained a model from scratch to predict Othello moves (as opposed to using a pretrained LLM).
Conclusion
Sydney seems to be significantly better at chess than ChatGPT (ETA: Bucky tested ChatGPT in more detail and it turns out it’s actually also decent at completing lines with the right prompt, though not FENs). Sydney does things that IMO clearly show it has at least some approximate internal representation of the board position (in particular, it can explicitly tell you approximately what the board position is after 25 moves). I was surprised by these results, though perhaps I shouldn’t have been given the Othello paper—for Sydney, chess games are only a small fraction of its training data, but on the other hand, it’s a much much bigger model, so it can still afford to spend a part of its capacity just on internal models of chess.
Also: prompting can be weird, based on very cursory experimentation it seems that asking Sydney to use Stockfish really does help (even though it can’t and just hallucinates Stockfish’s answers).
Appendix
My test game
The move sequence I gave Sydney to arrive at my position was
Just to be sure that the Stockfish prompt wasn’t the reason, I tried one of the exact prompts I used for Bing on ChatGPT and it failed completely, just making up a different early-game position.
Sydney can play chess and kind of keep track of the board state
TL;DR: Bing chat/Sydney can quite reliably suggest legal and mostly reasonable chess moves, based on just a list of previous moves (i.e. without explicitly telling it the board position). This works even deep-ish into the game (I tried up to ~30 moves). It can also specify the board position after a sequence of moves though it makes some mistakes like missing pieces or sometimes hallucinating them.
Zack Witten’s Twitter thread
Credit for discovering this goes to Zack Witten, I first saw this in this Twitter thread. Zack gave Sydney the first 14 moves for a chess game leading to the following position (black to move):
Sydney (playing both sides) suggested the continuation 14. … f5 15. exf5 Bxf5 16. Qd1 Bxc2 17. Qxc2 d3 18. Qxd3 Qxf2+19. Kh1 Qxe1+ 20. Ng1 Nf2# (see the Tweet for an animated gif of those moves). All these moves are legal and very reasonable (though White makes mistakes).
Note that the prompt for Sydney tells it to use Stockfish, and Sydney searches for online versions of Stockfish and claims that its moves are generated by Stockfish. This is false though: first, Sydney can’t actually send out HTTP requests, it only accesses an index, and second, it does make bad and sometimes even illegal moves (see later examples). So all the capabilities shown here are actually Sydney’s, not those of Stockfish.
The Twitter thread has more examples but I’ll skip them in favor of my own.
My own results
The position above is still reasonably early and a pretty normal chess position. I instead tried this somewhat weirder one (which arises after 25 moves, black to play):
(To be clear, Sydney got just the moves leading to this position, see Appendix, not explicitly the position itself.)
This is from an over the board game I played years ago, which has never been posted online, so it wasn’t in Sydney’s training data (and the continuation in the game was different anyway).
Sydney’s completion was: 25… Qc7 26. g5 Nd7 27. Nf5 Re8 28. Rh2 Be6 29. Rb2 Nc5 30. Bb5 Rb8 (it also adds some incorrect evaluations in between). Position at the end of that line:
Again, all of the moves are legal and they make a lot of sense—attacking pieces and then defending them or moving them away.
Sydney making mistakes
Sydney did much worse when I asked questions like “What are the legal moves of the black knight in the position after 25. h4?” (i.e. the first of my board positions shown above). See end of the first transcript in the appendix for an example.
Instead asking it to use Stockfish to find the two best moves for that knight worked better but still worse than the game completions. It said:
The first continuation is reasonable initially, though 29… Nxe4 is a bizarre blunder. In the second line, it blunders the knight immediately (25… Ne8 would is the actual second-best knight move). More interestingly, it then makes an illegal move (26… Qxd5+ tries to move the queen through its own pawn on d6).
Reconstructing the board position from the move sequence
Next, I asked Sydney to give me the FEN (a common encoding of chess positions) for the position after the length 25 move sequence. I told it to use Stockfish for that (even though this doesn’t make much sense)—just asking directly without that instruction gave significantly worse results. The FEN it gave is “r4rk1/4bppp/3p1n2/4p3/6PP/2P1PQ2/b7/3K1BNR b - − 0 25”, which is a valid FEN for the following position:
For reference, here’s the actual position again:
Sydney hallucinates an additional black rook on a8, messes up the position of the white knight and quuen a bit, and forgets about the black queen and a few pawns. On the other hand, there is a very clear resemblance between these positions.
I also asked it to just list all the pieces instead of creating an FEN (in a separate chat). The result was
This is missing two white pawns (c3 and h4) and again hallucinates a black rook on a8 and forgets black’s pawns on f7, g7, h7. It’s interesting that it hallucinates that rook in both cases, given these were separate chats. Also worth noting that it misses the pawn on h4 here even though that should be easy to get right (since the last move was moving that pawn to h4).
How does it do this?
My best guess is that Sydney isn’t quite keeping track of the board state in a robust and straightforward way. It does occasionally make illegal moves, and it has trouble with things like reconstructing the board position or listing legal moves. On the other hand, it seems very clear that it’s keeping track of some approximate board position, likely some ad-hoc messy encoding of where various pieces are.
Overall, I was quite surprised by these results. I would have predicted it would do much worse, similar to ChatGPT[1]. (ETA: Bucky tested ChatGPT in more detail and it turns out it’s actually decent at completing lines with the right prompt). I don’t have a great idea for how Sydney might do this internally.
Just how good is it? Hard to say based on these few examples, but the cases where it completed games were pretty impressive to me not just in terms of making legal moves but also reasonably good ones (maybe at the level of a beginner who’s played for a few months, though really hard to say). Arguably, we should compare the performance here with a human playing blindfold chess (i.e. not being allowed to look at the board). In that case, it’s likely better than most human beginners (who typically wouldn’t be able to play 30 move blindfold games without making illegal moves).
Some more things to try
Zack’s thread and this post only scratch the surface and I’m not-so-secretly hoping that someone else will test a lot more things because I’m pretty curious about this but currently want to focus on other projects.
Some ideas for things to test:
Just test this in more positions and more systematically, this was all very quick-and-dirty.
Try deeper lines than 30 moves to see if it starts getting drastically worse at some point.
Give it board positions in its input and ask it for moves in those (instead of just giving it a sequence of moves). See if it’s significantly stronger in that setting (might have to try a few encodings for board positions to see which works well).
You could also ask it to generate the board position after each move and see if/how much that helps. (Success in these settings seem less suprising/interesting to me but might be a good comparison)
Ask more types of questions to figure out how well it knows and understand the board position.
One of the puzzles right now is that Sydney seems better at suggesting reasonable continuation lines than at answering questions about legal moves and giving the board position. From a perspective of what’s in the training data, this makes a ton of sense. But from a mechanistic perspective, I don’t know how it works internally to produce that behavior. Is it just bad at translating its internal board representation into the formats we ask for, or answer questions about it? Or does it use a bunch of heuristics to suggest moves that aren’t purely based on tracking the board state?
An interesting piece of existing research is this paper, where they found evidence of a learned internal Othello board representation. However, they directly trained a model from scratch to predict Othello moves (as opposed to using a pretrained LLM).
Conclusion
Sydney seems to be significantly better at chess than ChatGPT (ETA: Bucky tested ChatGPT in more detail and it turns out it’s actually also decent at completing lines with the right prompt, though not FENs). Sydney does things that IMO clearly show it has at least some approximate internal representation of the board position (in particular, it can explicitly tell you approximately what the board position is after 25 moves). I was surprised by these results, though perhaps I shouldn’t have been given the Othello paper—for Sydney, chess games are only a small fraction of its training data, but on the other hand, it’s a much much bigger model, so it can still afford to spend a part of its capacity just on internal models of chess.
Also: prompting can be weird, based on very cursory experimentation it seems that asking Sydney to use Stockfish really does help (even though it can’t and just hallucinates Stockfish’s answers).
Appendix
My test game
The move sequence I gave Sydney to arrive at my position was
Unfortunately I’m an idiot and lost some of the chat transcripts, but here are the two I still have:
Transcript 1 (suggesting lines, listing legal moves)
Transcript 2 (describing the position)
Just to be sure that the Stockfish prompt wasn’t the reason, I tried one of the exact prompts I used for Bing on ChatGPT and it failed completely, just making up a different early-game position.