Both are great points, especially #1. I’ll run some experiments and report back.
Adam Karvonen
That’s an interesting idea, I may test that out at some point. I’m assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?
The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.
I haven’t done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.
Yes, in this recent OpenAI superalignment paper they said that GPT-4′s training dataset included a dataset of chess games filtered for players with greater than 1800 Elo. Given gpt-3.5-turbo-instruct’s ability, I’m guessing that its dataset included a similar collection.
I had the following results:
Stockfish level 2 vs Stockfish level 0, 0.01 seconds per move, 5k games:
0 random moves: win rate 81.2%
20 random moves: win rate 81.2%
40 random moves: 77.9%
95% confidence interval is about +- 1%
Stockfish level 15 vs level 9, 0.01 seconds per move, 5k games:
0 random moves: 65.5%
20 random moves: 72.8%
40 random moves: 67.5%
Once again, 95% confidence interval is about +- 1%
At 120 seconds per move, both of these level differences correspond to ~300 Elo: https://github.com/official-stockfish/Stockfish/commit/a08b8d4
This is 0.01 seconds per move. It appears that less search time lowers the Elo difference for level 15 vs level 9. A 65% win rate corresponds to a ~100 Elo difference, while a 81% win rate corresponds to a 250-300 Elo difference.
Honestly not too sure what to make of the results. One possible variable is that in every case, the higher level player is White. Starting in a game with a random position may favor the first to move. Level 2 vs level 0 seems most applicable to the Chess-GPT setting.