Some other order-of-magnitude estimates on available data, assuming words roughly equal tokens:
Wikipedia: 4B English words, according to this page.
Library of Congress: from this footnote a assume there are at most 100 million books worth of text in the LoC and from this page assume that books are 100k words, giving 10T words at most.
Constant writing: I estimate that a typical person writes at most 1000 words per day, with maybe 100 million people writing this amount of English on the internet. Over the last 10 years, these writers would have produced 370T words.
Research papers: this page estimates ~4m papers are published each year, at 10k words per paper with 100 years of research this amounts to 4T words total.
So it looks like 10T words is an optimistic order-of-magnitude estimate of the total amount of data available.
I assume the importance of a large quantity of clean text data will lead to the construction of a text database of ~1T tokens and that this database (or models trained on it) will eventually be open-sourced.
From there, it seems like really digging in to the sources of irreducible error will be necessary for further scaling. I would guess that a small part of this is “method error” (training details, context window, etc.) but that a significant fraction comes from intrinsic text entropy. Some entropy has to be present, or else text would have no information value.
I would guess that this irreducible error can probably be broken down into:
Uncertainty about the specific type of text the model is trying to predict (e.g. it needs some data to figure out that it’s supposed to write in modern English, then more data to learn that the writing is flamboyant/emotional, then more to learn that there is a narrative structure, then more to determine that it is a work of fiction etc.). The model will always need some data to specify which text-generating sub-model to use. This error can be reduced with better prompts (though not completely eliminated)
Uncertainty about location within the text. For example, even if the model had memorized a specific play by Shakespeare, if you asked it to do next-word prediction on a random paragraph from the text, it would have trouble predicting the first few words simply because it hasn’t determined which paragraph it has been given. This error should go away when looking at next-word prediction after the model has been fed enough data. Better prompts and a larger context window should help.
Uncertainty inherent to the text. This related to the actual information content of the text, and should be irreducible. I’m not sure about the relative size of this uncertainty compared to the other ones, but this paper suggests an entropy of ~10 bits/word in English (which seems high?). I don’t know how entropy translates into training loss for these models. Memorization of key facts (or database access) can reduce the average information content of a text.
EDIT: also note that going from 10T to 100T tokens would only reduce the loss by 0.045, so it may not be worthwhile to increase dataset size beyond the 10T order-of-magnitude.
I think the models are evaluated on inputs that fill their whole context window, ie. ~1024 tokens long. I doubt there is many parts in Shakespeare’s plays with the same 1024 tokens repeated.
Some other order-of-magnitude estimates on available data, assuming words roughly equal tokens:
Wikipedia: 4B English words, according to this page.
Library of Congress: from this footnote a assume there are at most 100 million books worth of text in the LoC and from this page assume that books are 100k words, giving 10T words at most.
Constant writing: I estimate that a typical person writes at most 1000 words per day, with maybe 100 million people writing this amount of English on the internet. Over the last 10 years, these writers would have produced 370T words.
Research papers: this page estimates ~4m papers are published each year, at 10k words per paper with 100 years of research this amounts to 4T words total.
So it looks like 10T words is an optimistic order-of-magnitude estimate of the total amount of data available.
I assume the importance of a large quantity of clean text data will lead to the construction of a text database of ~1T tokens and that this database (or models trained on it) will eventually be open-sourced.
From there, it seems like really digging in to the sources of irreducible error will be necessary for further scaling. I would guess that a small part of this is “method error” (training details, context window, etc.) but that a significant fraction comes from intrinsic text entropy. Some entropy has to be present, or else text would have no information value.
I would guess that this irreducible error can probably be broken down into:
Uncertainty about the specific type of text the model is trying to predict (e.g. it needs some data to figure out that it’s supposed to write in modern English, then more data to learn that the writing is flamboyant/emotional, then more to learn that there is a narrative structure, then more to determine that it is a work of fiction etc.). The model will always need some data to specify which text-generating sub-model to use. This error can be reduced with better prompts (though not completely eliminated)
Uncertainty about location within the text. For example, even if the model had memorized a specific play by Shakespeare, if you asked it to do next-word prediction on a random paragraph from the text, it would have trouble predicting the first few words simply because it hasn’t determined which paragraph it has been given. This error should go away when looking at next-word prediction after the model has been fed enough data. Better prompts and a larger context window should help.
Uncertainty inherent to the text. This related to the actual information content of the text, and should be irreducible. I’m not sure about the relative size of this uncertainty compared to the other ones, but this paper suggests an entropy of ~10 bits/word in English (which seems high?). I don’t know how entropy translates into training loss for these models. Memorization of key facts (or database access) can reduce the average information content of a text.
EDIT: also note that going from 10T to 100T tokens would only reduce the loss by 0.045, so it may not be worthwhile to increase dataset size beyond the 10T order-of-magnitude.
I think the models are evaluated on inputs that fill their whole context window, ie. ~1024 tokens long. I doubt there is many parts in Shakespeare’s plays with the same 1024 tokens repeated.
Oh I didn’t realize! Thanks for clarifying. Uncertainty about location probably doesn’t contribute much to the loss then.