Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!
Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?)
Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.
(To understand this, imagine that just the token “Hi” and nothing else was 1⁄2 of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)
(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)
It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.
OpenAI seems to train on >1% chess
If you sample from davinci-002 at t=1 starting from “<|endoftext|>”, 4% of the completions are chess games.
Code
For babbage-002, we get 1%.
Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!
Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?) Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.
(To understand this, imagine that just the token “Hi” and nothing else was 1⁄2 of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)
(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)
It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.
Appendix A.2 of the weak-to-strong generalization paper explicitly notes that GPT-4 was trained on at least some chess.