I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.
I had a system set up to rate it, but it wasn’t able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that’s something like a 550 iirc? But. It’s MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point—including well past any opening book it could possibly have “lossless in its database” (not how it works) - it plays much better, subjectively 1300-1400.
That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play.
Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging.
For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and temperature in the PGN:
If your scripts allow testing this prompt, I’d be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.
I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.
I had a system set up to rate it, but it wasn’t able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that’s something like a 550 iirc? But. It’s MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point—including well past any opening book it could possibly have “lossless in its database” (not how it works) - it plays much better, subjectively 1300-1400.
That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play.
Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging.
For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and temperature in the PGN:
https://lichess.org/study/ymmMxzbj/SyefzR3j
If your scripts allow testing this prompt, I’d be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.