A chess game against GPT-4
I just bought a subscription to access GPT-4 and played the following chess game against it, with me playing white. (No particular agenda, was just curious how good it is.)
At this point (move 31), GPT-4 suggested Kxc4, which is not legal, and when I asked it to correct, it suggested Kd5 and Kb6 which aren’t legal either (only legal move here is Kc6.)
Stuff I noticed:
As was pointed out before, it’s much better than GPT3.5, which started playing illegal moves much earlier. But it still started playing illegal moves eventually, so I’m not sure if it makes sense to assign it a rating.
It missed the early “removing the defender” tactic where I could exchange my bishop for its knight, which was defending its bishop; otherwise it played ok till the end
Move 29 and 30 (last two before it tried illegal moves) were just giving stuff away.
It explained both my and its moves every time; those explanations got wrong earlier. (After it recaptured my queen on move 17, it said it maintained material balance; after move 20 it said it pinned my knight to the rook on c1, but there was no rook on c1; from there, most of it was wrong.)
I wrote 19. Rfd8 instead of 19. Rfd1 by accident, it it replied with “I assume you meant 19. Rfd1, placing your rook on the open d-file opposing my rook. I’ll respond with 19...e5, attacking your knight on d4 and trying to grab some space in the center.”. Very helpful!
After move 14 (the first move with the black rook), I asked it to evaluate the position, and it said that white has a small advantage. But it blundered a piece, so this position is completely winning for white (Stockfish says +5.2)
(PGN: 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9 7. e3 c6 $6 8. Nc3 Nc7 9. Rc1 $6 b6 10. Qb3 Ba6 11. Qa4 $6 Qd7 $4 12. Bxc7 $1 Qxc7 13. Qxa6 dxc4 14. Qxc4 Rac8 15. Bd3 c5 16. O-O cxd4 17. Qxc7 Rxc7 18. Nxd4 Rd8 19. Rfd1 e5 20. Nf5 Bb4 21. Ng3 Rcd7 22. Bb5 Rxd1+ 23. Rxd1 Rxd1+ 24. Nxd1 Kf8 25. Nc3 Ke7 26. a3 Bxc3 27. bxc3 Kd6 28. Kf1 Kc5 29. c4 a6 $6 30. Bxa6 Ne4 31. Nxe4+)
- Putting multimodal LLMs to the Tetris test by 1 Feb 2024 16:02 UTC; 30 points) (
- AI-assisted alignment proposals require specific decomposition of capabilities by 30 Mar 2023 21:31 UTC; 16 points) (
- 19 Jul 2023 17:56 UTC; 1 point) 's comment on Quick Thoughts on Language Models by (EA Forum;
- Exploring GPT4′s world model by 20 Mar 2023 21:31 UTC; -5 points) (
I’m very eager to see its performance once we can use visual inputs and showing it board states visually after each move.
If I get early access to the visual model, I will definitely try this
Note that at least for ChatGPT (3.5), telling it to not explain anything and only output moves apparently helps. (It can play legal moves for longer that way). So that might be worth trying if you want to get better performance. Of course, giving it the board state after each move could also help but might require trying a couple different formats.
To describe the current board state, something like this seems reasonable.
I’ve created an interface for playing against llm powered chess agents. Here is the link https://llmchess.org/.
I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.
I had a system set up to rate it, but it wasn’t able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that’s something like a 550 iirc? But. It’s MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point—including well past any opening book it could possibly have “lossless in its database” (not how it works) - it plays much better, subjectively 1300-1400.
That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play.
Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging.
For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and temperature in the PGN:
https://lichess.org/study/ymmMxzbj/SyefzR3j
If your scripts allow testing this prompt, I’d be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.
Did you and GPT4 only output the moves, or did you also output the board state after each turn?
Moves only
Here’s an example for how to play chess with GPT4, using only txt in https://chat.openai.com …
does it play better / make legal moves for longer this way?
Caleb Parikh and I were curious about GPT-4′s internal models of chess as a result of this post, so we asked it some questions about the state partway through this game:
It replied:
(and explained that lowercase letters were black pieces and uppercase letters were white pieces, which I didn’t know; I don’t play chess).
This… is not an accurate picture of the game board (what are all those pawns doing on Black’s back row?) We also asked it for a list of legal moves that White could make next, and it described some accurately but some inaccurately (e.g. listed one as a capture even though it wasn’t).
This is pretty funny because the supposed board state has only 7 columns. Yet it’s also much better than random. A lot of the pieces are correct… that is, if you count from the left (real board state is here).
Also, I’ve never heard of using upper and lowercase to differentiate white and black,
I think GPT-4 just made that up.(edit: or not; see reply.)Extra twist: I just asked a new GPT-4 instance whether any chess notation differentiates lower and upper case, and it told me algebraic notation does, but that’s the standard notation, and it doesn’t. Wikipedia article also says nothing about it. Very odd.
No, this is common. E.g. https://github.com/niklasf/python-chess
Hah, I didn’t even notice that.
XD
On a retry, it didn’t decide to summarize the board and successfully listed a bunch of legal moves for White to make. Although I asked for all legal moves, the list wasn’t exhaustive; upon prompting about this, it apologized and listed a few more moves, some of which were legal and some which were illegal, still not exhaustive.
I was just talking with Bing about how quickly transformer AI might surpass human intelligence, and it was a sensible conversation until it hallucinated a nonexistent study in which GPT-4 was tested on 100 scenarios and dilemmas and performed badly.
What these interactions have in common, is that GPT-4 does well for a while, then goes off the rails. It makes me curious about the probability of going wrong—is there a constant risk per unit time, or does the risk per unit time actually increase with the length of the interaction, and if so, why?
The probability of going wrong increases as the novelty of the situation increases. As the chess game is played, the probability that the game is completely novel or literally never played before increases. Even more so at the amateur level. If a Grandmaster played GPT3/4, it’s going to go for much longer without going off the rails, simply because the first 20 something moves are likely played many times before and have been directly trained on.
Right, though 20 moves until a new game is very rare afaik (assuming the regular way of counting, where 1 move means one from both sides). But 15 is commonplace. According to chess.com (which I think only includes top games, though not sure) this one was new up from move 6 by white.
How did your prompt GPT4?
If you mean how I accessed it at all, I used the official channel from OpenAI: https://chat.openai.com/chat
If you have a premium account (20$/month), you can switch to GPT-4 after starting a new chat.
I think with the right prompting, it is around 1400 Elo, at least against strong opponents. Note, however, that this is based on a small sample; on the flip side, all my test games (against myself and three relatively weak computer opponents, with the strongest computer opponent tried being fairly strong club player level) are in a lichess study linked to from here:
https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/gpt-4?commentId=TaaAtoM4ahkfc37dR
The prompting used is heavily inspired by Bucky’s comments from the Sydney-and-chess thread. I haven’t optimised it for GPT-4 in any way.
I also tested if GPT-4 can play a game taking queen odds against an opponent that is strong compared to most humans (Leela Chess Zero at a few nodes per move). This was the case, with GPT-4 winning. However, I haven’t documented that game.
It is much weaker at commenting than at playing under these conditions. However, it does know when its position is very bad, as I have seen it resign at a late but reasonable point when I worked the possibility to resign into the setup prompt.
I wonder if you were to take GPT-4 and train it in self play, how good it would get, and how quickly...