johnswentworth comments on johnswentworth’s Shortform

johnswentworth 15 Nov 2024 21:20 UTC
2 points
0
I don’t expect that to be particularly relevant. The data wall is still there; scaling just compute has considerably worse returns than the curves we’ve been on for the past few years, and we’re not expecting synthetic data to be anywhere near sufficient to bring us close to the old curves.
- Vladimir_Nesov 15 Nov 2024 22:24 UTC
  5 points
  0
  Parent
  Nobody admitted to trying repeated data at scale yet (so we don’t know that it doesn’t work), which from the tiny experiments can 5x the data with little penalty and 15x the data in a still-useful way. It’s not yet relevant for large models, but it might turn out that small models would greatly benefit already.
  
  There are 15-20T tokens in datasets whose size is disclosed for current models (Llama 3, Qwen 2.5), plausibly 50T tokens of tolerable quality can be found (pretraining only needs to create useful features, not relevant behaviors). With 5x 50T tokens, even at 80 tokens/parameter^[1] we can make good use of 5e27-7e27 FLOPs^[2], which even a 1 gigawatt 500K B200s system of early 2026 would need 4-6 months to provide.
  
  The isoFLOP plots (varying tokens per parameter for fixed compute) seem to get loss/perplexity basins that are quite wide, once they get about 1e20 FLOPs of compute. The basins also get wider for hybrid attention (compare 100% Attention isoFLOPs in the “Perplexity scaling analysis” Figure to the others). So it’s likely that using a slightly suboptimal tokens/parameter ratio of say 40 won’t hurt performance much at all. In which case we get to use 9e27-2e28 FLOPs by training a larger model on the same 5x 50T tokens dataset. The data wall for text data is unlikely to be a 2024-2026 issue.
  ↩︎
  Conservatively asking for much more data than Chinchilla’s 20 tokens per parameter, in light of the range of results in more recent experiments and adding some penalty for repetition of data. For example, Llama 3 had 40 tokens per parameter estimated as optimal for 4e25 FLOPs from isoFLOPs for smaller runs (up to 1e22 FLOPs, Figure 2), and linear extrapolation in log-coordinates (Figure 3) predicts that this value slowly increases with compute. But other experiments have it decreasing with compute, so this is unclear.
  
  ↩︎
  The usual estimate for training compute of a dense transformer is 6ND, but a recent Tencent paper estimates 9.6ND for their MoE model (Section 2.3.1).
  - johnswentworth 15 Nov 2024 23:11 UTC
    2 points
    −1
    Parent
    FYI, my update from this comment was:
    Hmm, seems like a decent argument...
    … except he said “we don’t know that it doesn’t work”, which is an extremely strong update that it will clearly not work.
    - Vladimir_Nesov 15 Nov 2024 23:47 UTC
      10 points
      0
      Parent
      Use of repeated data was first demonstrated in the 2022 Galactica paper (Figure 6 and Section 5.1), at 2e23 FLOPs but without a scaling law analysis that compares with unique data or checks what happens for different numbers of repeats that add up to the same number of tokens-with-repetition. The May 2023 paper does systematic experiments with up to 1e22 FLOPs datapoints (Figure 4).
      
      So that’s what I called “tiny experiments”. When I say that it wasn’t demonstrated at scale, I mean 1e25+ FLOPs, which is true for essentially all research literature^[1]. Anchoring to this kind of scale (and being properly suspicious of results several orders of magnitude lower) is relevant because we are discussing the fate of 4e27 FLOPs runs.
      
      ↩︎
      The largest datapoints in measuring the Chinchilla scaling laws for Llama 3 are 1e22 FLOPs. This is then courageously used to choose the optimal model size for the 4e25 FLOPs run that uses 4,000 times more compute than the largest of the experiments.