Vladimir_Nesov comments on johnswentworth’s Shortform

Vladimir_Nesov Nov 15, 2024, 7:47 PM
34 points
12
Original GPT-4 is rumored to be a 2e25 FLOPs model. With 20K H100s that were around as clusters for more than a year, 4 months at 40% utilization gives 8e25 BF16 FLOPs. Llama 3 405B is 4e25 FLOPs. The 100K H100s clusters that are only starting to come online in the last few months give 4e26 FLOPs when training for 4 months, and 1 gigawatt 500K B200s training systems that are currently being built will give 4e27 FLOPs in 4 months.

So lack of scaling-related improvement in deployed models since GPT-4 is likely the result of only seeing the 2e25-8e25 FLOPs range of scale so far. The rumors about the new models being underwhelming are less concrete, and they are about the very first experiments in the 2e26-4e26 FLOPs range. Only by early 2025 will there be multiple 2e26+ FLOPs models from different developers to play with, the first results of the experiment in scaling considerably past GPT-4.

And in 2026, once the 300K-500K B200s clusters train some models, we’ll be observing the outcomes of scaling to 2e27-6e27 FLOPs. Only by late 2026 will there be a significant chance of reaching a scaling plateau that lasts for years, since scaling further would need $100 billion training systems that won’t get built without sufficient success, with AI accelerators improving much slower than the current rate of funding-fueled scaling.
What links here?
- Vladimir_Nesov's comment on mesaoptimizer’s Shortform by mesaoptimizer (Nov 30, 2024, 8:01 PM; 8 points)
- johnswentworth Nov 15, 2024, 9:20 PM
  6 points
  −10
  Parent
  I don’t expect that to be particularly relevant. The data wall is still there; scaling just compute has considerably worse returns than the curves we’ve been on for the past few years, and we’re not expecting synthetic data to be anywhere near sufficient to bring us close to the old curves.
  - Vladimir_Nesov Nov 15, 2024, 10:24 PM
    19 points
    8
    Parent
    Nobody admitted to trying repeated data at scale yet (so we don’t know that it doesn’t work), which from the tiny experiments can 5x the data with little penalty and 15x the data in a still-useful way. It’s not yet relevant for large models, but it might turn out that small models would greatly benefit already.
    
    There are 15-20T tokens in datasets whose size is disclosed for current models (Llama 3, Qwen 2.5), plausibly 50T tokens of tolerable quality can be found (pretraining only needs to create useful features, not relevant behaviors). With 5x 50T tokens, even at 80 tokens/parameter^[1] we can make good use of 5e27-7e27 FLOPs^[2], which even a 1 gigawatt 500K B200s system of early 2026 would need 4-6 months to provide.
    
    The isoFLOP plots (varying tokens per parameter for fixed compute) seem to get loss/perplexity basins that are quite wide, once they get about 1e20 FLOPs of compute. The basins also get wider for hybrid attention (compare 100% Attention isoFLOPs in the “Perplexity scaling analysis” Figure to the others). So it’s likely that using a slightly suboptimal tokens/parameter ratio of say 40 won’t hurt performance much at all. In which case we get to use 9e27-2e28 FLOPs by training a larger model on the same 5x 50T tokens dataset. The data wall for text data is unlikely to be a 2024-2026 issue.
    
    ↩︎
    Conservatively asking for much more data than Chinchilla’s 20 tokens per parameter, in light of the range of results in more recent experiments and adding some penalty for repetition of data. For example, Llama 3 had 40 tokens per parameter estimated as optimal for 4e25 FLOPs from isoFLOPs for smaller runs (up to 1e22 FLOPs, Figure 2), and linear extrapolation in log-coordinates (Figure 3) predicts that this value slowly increases with compute. But other experiments have it decreasing with compute, so this is unclear.
    
    ↩︎
    The usual estimate for training compute of a dense transformer is 6ND, but a recent Tencent paper estimates 9.6ND for their MoE model (Section 2.3.1).
    What links here?
    Vladimir_Nesov's comment on Densing Law of LLMs by Bogdan Ionut Cirstea (Dec 9, 2024, 3:18 AM; 10 points)
    - johnswentworth Nov 15, 2024, 11:11 PM
      1 point
      −11
      Parent
      FYI, my update from this comment was:
      Hmm, seems like a decent argument...
      … except he said “we don’t know that it doesn’t work”, which is an extremely strong update that it will clearly not work.
      - Vladimir_Nesov Nov 15, 2024, 11:47 PM
        22 points
        5
        Parent
        Use of repeated data was first demonstrated in the 2022 Galactica paper (Figure 6 and Section 5.1), at 2e23 FLOPs but without a scaling law analysis that compares with unique data or checks what happens for different numbers of repeats that add up to the same number of tokens-with-repetition. The May 2023 paper does systematic experiments with up to 1e22 FLOPs datapoints (Figure 4).
        
        So that’s what I called “tiny experiments”. When I say that it wasn’t demonstrated at scale, I mean 1e25+ FLOPs, which is true for essentially all research literature^[1]. Anchoring to this kind of scale (and being properly suspicious of results several orders of magnitude lower) is relevant because we are discussing the fate of 4e27 FLOPs runs.
        
        ↩︎
        The largest datapoints in measuring the Chinchilla scaling laws for Llama 3 are 1e22 FLOPs. This is then courageously used to choose the optimal model size for the 4e25 FLOPs run that uses 4,000 times more compute than the largest of the experiments.