The main result is that up to 4 repetitions are about as good as unique data,
and for up to about 16 repetitions there is still meaningful improvement.
Let’s take 50T tokens as an estimate for available text data
(as an anchor, there’s a filtered and deduplicated
CommonCrawl dataset RedPajama-Data-v2
with 30T tokens).
Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer),
and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs.
So this is close but not lower than what can be put to use within a few years.
Thanks for pushing back on the original claim.
I’ve now changed my mind based on
N Muennighoff et al. (2023) Scaling Data-Constrained Language Models
The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let’s take 50T tokens as an estimate for available text data (as an anchor, there’s a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.