That sounds not-so-impressive, until you consider that it’s effectively playing blindfolded, having access to only the game’s moves in algebraic notation, and not a visual of a chessboard.
Why not just give it access to a visual of a chessboard?
Hardly anyone seems to have access to the image-enabled GPT-4 yet, and thus far GPT-4 results on things like text art of Tic-tac-toe or Sudoku have not done well, so it doesn’t look like you can just ASCII-art your way out of problems like this yet.
I was a little surprised the ASCII-art approach didn’t work, because my original guess for the multimodal GPT-4 training had been that it was iGPT/DALL-E-1 style: just serializing images into visual tokens to train autoregressively on as a prefix. However, the GPT-4 architecture leak claims that it’s implemented in a completely different way: as a separate model which does cross-attention to GPT-4. So that helps deal with the problem of the visual tokens losing detail while using up a lot of context, but is also not something that necessarily would instill a lot of visual knowledge into the text-only model. (The text-only model might even be completely frozen and untrained, like Flamingo etc.)
Anyway, as I’ve been suggesting since 2019 or so, it’d probably be useful to train some small chess Transformers directly on both PGN and FEN (and mixtures thereof, like a FEN every n PGN moves) to help elucidate Transformer world-modeling. Any GPT-n result is as much a result about the proprietary internal OA datasets (and effects of BPEs) as it is about Transformer world-modeling or capabilities.
We collect chess game data from a one-month dump of the Lichess dataset, deliberately distinct from the month used in our own Lichess dataset. we design several model-based tasks including converting PGN to FEN, transferring UCI to FEN, and predicting legal moves, etc, resulting in 1.9M data samples.
Looks like they do a lot of things with FEN and train on a corpus which includes some FEN<->move-based-representation tasks, but they don’t really do anything which directly bears on this, aside from showing that their chess GPTs can turn UCI/sPGNs into FENs with high accuracy. That would seem to imply that it has learned a good model of chess, because otherwise how could it take a move-move-by-move description (like UCI/PGN) and print out an accurate summary of the final board state (like FEN)?
I would’ve preferred to see them do something like train FEN-only vs PGN-only chess GPTs (where the games are identical, just the encoding is different) and demonstrate the former is better, where ‘better’ means something like have a higher Elo or it doesn’t get worse towards the end of the game.
I did train a transformer to predict moves from board positions (not strictly FEN because with FEN positional encodings don’t point to the same squares consistently). Maybe I’ll get around to letting it compete against the different GPTs.
Something I always want to try is training a super tiny transformer on Tic Tac Toe, and see what it comes up with and how many games it needs to generalise.
GPT-3.5 isn’t multimodal, so can’t really do that; I do wonder whether it would make GPT-4′s performance even better, though.
That said, this being a text-only model, really the only relevant information that would improve the situation is a freeze frame of the current state of the chessboard, expressed in any way—visuals just happen to work best for us, but GPT’s natural domain is the written word. So the correct test would probably be to replace the score (which requires computation to reconstruct the board state from scratch) with some kind of notation to instead represent the current board, for example by using Forsyth-Edwards Notation. I’d like to see if that makes it play well for longer (also, it shortens the prompts, so it should avoid running out of context window).
FEN is essentially the same thing as that, but better. Try to “think as a GPT”—if you’re a fundamentally textual mind, then a no-frills, standardized representation that compresses all required information in a few well-known tokens will be ideal. With a custom representation it might instead have to learn it, the Unicode chess symbols may be unusual tokens, and any added tabulation or decoration is more of a source of confusion than anything else. It improves clarity for us humans, because we’re highly visual beings, not necessarily for a transformer. Text does that a lot more straighforwardly, and if it’s something that is likely to have appeared a lot in the training set, all the better.
I get the argument, but I’m not sure it’s true. There might be enough Unicode chessboards on the internet that it has learned the basics of the representation, and it might be able to transfer-learn some strategies it sees in other notations to become good at Unicode chessboards, and a transformer might be able to exploit the geometry of the chessboard. Not sure.
Either FEN or a unicode chessboard could be interesting. Comparing both could be interesting too.
It’s a good thought, and I had the same one a while ago, but I think dr_s is right here; FEN isn’t helpful to GPT-3.5 because it hasn’t seen many FENs in its training, and it just tends to bungle it.
GPT-3.5 has trouble from the start maintaining a correct FEN, and makes its first illegal move on move 7, and starts making many illegal moves around move 13.
Ah, dang it. So it’s a damned if you do, damned if you don’t—it has seen lots of scores, but they’re computationally difficult to keep track of since they’re basically “diffs” of the board state. But there’s not enough FEN or other board notation going around for it to have learned to use that reliably. It cuts at the heart of one of the key things that hold back GPT from generality—it seems like it needs to learn each thing separately, and doesn’t transfer skills that well. If not for this, honestly, I’d call it AGI already in terms of the sheer scope of the things it can do.
Why not just give it access to a visual of a chessboard?
Hardly anyone seems to have access to the image-enabled GPT-4 yet, and thus far GPT-4 results on things like text art of Tic-tac-toe or Sudoku have not done well, so it doesn’t look like you can just ASCII-art your way out of problems like this yet.
I was a little surprised the ASCII-art approach didn’t work, because my original guess for the multimodal GPT-4 training had been that it was iGPT/DALL-E-1 style: just serializing images into visual tokens to train autoregressively on as a prefix. However, the GPT-4 architecture leak claims that it’s implemented in a completely different way: as a separate model which does cross-attention to GPT-4. So that helps deal with the problem of the visual tokens losing detail while using up a lot of context, but is also not something that necessarily would instill a lot of visual knowledge into the text-only model. (The text-only model might even be completely frozen and untrained, like Flamingo etc.)
Anyway, as I’ve been suggesting since 2019 or so, it’d probably be useful to train some small chess Transformers directly on both PGN and FEN (and mixtures thereof, like a FEN every n PGN moves) to help elucidate Transformer world-modeling. Any GPT-n result is as much a result about the proprietary internal OA datasets (and effects of BPEs) as it is about Transformer world-modeling or capabilities.
The ChessGPT paper does something like that: https://arxiv.org/abs/2306.09200
I hadn’t seen that recent paper, thanks.
Looks like they do a lot of things with FEN and train on a corpus which includes some FEN<->move-based-representation tasks, but they don’t really do anything which directly bears on this, aside from showing that their chess GPTs can turn UCI/sPGNs into FENs with high accuracy. That would seem to imply that it has learned a good model of chess, because otherwise how could it take a move-move-by-move description (like UCI/PGN) and print out an accurate summary of the final board state (like FEN)?
I would’ve preferred to see them do something like train FEN-only vs PGN-only chess GPTs (where the games are identical, just the encoding is different) and demonstrate the former is better, where ‘better’ means something like have a higher Elo or it doesn’t get worse towards the end of the game.
I did train a transformer to predict moves from board positions (not strictly FEN because with FEN positional encodings don’t point to the same squares consistently). Maybe I’ll get around to letting it compete against the different GPTs.
Something I always want to try is training a super tiny transformer on Tic Tac Toe, and see what it comes up with and how many games it needs to generalise.
GPT-3.5 isn’t multimodal, so can’t really do that; I do wonder whether it would make GPT-4′s performance even better, though.
That said, this being a text-only model, really the only relevant information that would improve the situation is a freeze frame of the current state of the chessboard, expressed in any way—visuals just happen to work best for us, but GPT’s natural domain is the written word. So the correct test would probably be to replace the score (which requires computation to reconstruct the board state from scratch) with some kind of notation to instead represent the current board, for example by using Forsyth-Edwards Notation. I’d like to see if that makes it play well for longer (also, it shortens the prompts, so it should avoid running out of context window).
FEN is definitely an option. By “visual”, what I had in mind would be e.g. assembling an 8 by 8 grid of characters using e.g. https://en.m.wikipedia.org/wiki/Chess_symbols_in_Unicode
What I’m wondering is why people don’t do this.
FEN is essentially the same thing as that, but better. Try to “think as a GPT”—if you’re a fundamentally textual mind, then a no-frills, standardized representation that compresses all required information in a few well-known tokens will be ideal. With a custom representation it might instead have to learn it, the Unicode chess symbols may be unusual tokens, and any added tabulation or decoration is more of a source of confusion than anything else. It improves clarity for us humans, because we’re highly visual beings, not necessarily for a transformer. Text does that a lot more straighforwardly, and if it’s something that is likely to have appeared a lot in the training set, all the better.
I get the argument, but I’m not sure it’s true. There might be enough Unicode chessboards on the internet that it has learned the basics of the representation, and it might be able to transfer-learn some strategies it sees in other notations to become good at Unicode chessboards, and a transformer might be able to exploit the geometry of the chessboard. Not sure.
Either FEN or a unicode chessboard could be interesting. Comparing both could be interesting too.
It’s a good thought, and I had the same one a while ago, but I think dr_s is right here; FEN isn’t helpful to GPT-3.5 because it hasn’t seen many FENs in its training, and it just tends to bungle it.
Lichess study, ChatGPT conversation link
GPT-3.5 has trouble from the start maintaining a correct FEN, and makes its first illegal move on move 7, and starts making many illegal moves around move 13.
Apparently it also bungles the unicode representation: https://chat.openai.com/share/10b8b0d3-7c80-427a-aaf7-ea370f3a471b
Ah, dang it. So it’s a damned if you do, damned if you don’t—it has seen lots of scores, but they’re computationally difficult to keep track of since they’re basically “diffs” of the board state. But there’s not enough FEN or other board notation going around for it to have learned to use that reliably. It cuts at the heart of one of the key things that hold back GPT from generality—it seems like it needs to learn each thing separately, and doesn’t transfer skills that well. If not for this, honestly, I’d call it AGI already in terms of the sheer scope of the things it can do.