ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 17 Dec 2023 2:39 UTC
21 points
2
OpenAI seems to train on >1% chess

If you sample from davinci-002 at t=1 starting from “<|endoftext|>”, 4% of the completions are chess games.

Code

For babbage-002, we get 1%.

Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!

Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?) Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.

(To understand this, imagine that just the token “Hi” and nothing else was ¹⁄₂ of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)

(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)

It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.

Appendix A.2 of the weak-to-strong generalization paper explicitly notes that GPT-4 was trained on at least some chess.
What links here?
- ryan_greenblatt's comment on A Chess-GPT Linear Emergent World Representation by Adam Karvonen (8 Feb 2024 19:40 UTC; 6 points)
- ryan_greenblatt's comment on Rauno’s Shortform by Rauno Arike (15 Nov 2024 18:54 UTC; 5 points)