Aaron_Scher comments on Aaron_Scher’s Shortform

Aaron_Scher 15 Jul 2024 7:04 UTC
9 points
0
I sometimes want to point at a concept that I’ve started calling The Scaling Picture. While it’s been discussed at length (e.g., here, here, here), I wanted to give a shot at writing a short version:
- The picture:
  - We see improving AI capabilities as we scale up compute, projecting the last few years of progress in LLMs forward might give us AGI (transformative economic/political/etc. impact similar to the industrial revolution; AI that is roughly human-level or better on almost all intellectual tasks) later this decade (note: the picture is not about specific capabilities so much as the general picture).
  - Relevant/important downstream capabilities improve as we scale up pre-training compute (size of model and amount of data), although for some metrics there are very sublinear returns — this is the current trend. Therefore, you can expect somewhat predictable capability gains in the next few years as we scale up spending (increase compute), and develop better algorithms / efficiencies.
  - AI capabilities in the deep learning era are the result of three inputs: data, compute, algorithms. Keeping algorithms the same, and scaling up the others, we get better performance — that’s what scaling means. We can lump progress in data and algorithms together under the banner “algorithmic progress” (i.e., how much intelligence can you get per compute) and then to some extent we can differentiate the source of progress: algorithmic progress is primarily driven by human researchers, while compute progress is primarily driven by spending more money to buy/rent GPUs. (this may change in the future). In the last few years of AI history, we have seen massive gains in both of these areas: it’s estimated that the efficiency of algorithms has improved about 3x/year, and the amount of compute used has increased 4.1x/year. These are ludicrous speeds relative to most things in the world.
  - Edit to add: This paper seems like it might explain that breakdown better.
  - Edit to add: The below arguments are just supposed to be pointers toward longer argument one could make, but the one sentence version usually isn’t compelling on its own.
- Arguments for:
  - Scaling laws (mathematically predictable relationship between pretraining compute and perplexity) have held for ~1 2 orders of magnitude already
  - We are moving though ‘orders of magnitude of compute’ quickly, so lots of probability mass should be soon (this argument is more involved, following from having uncertainty over orders of magnitude of compute that might be necessary for AGI, like the approach taken here; see here for discussion)
  - Once you get AIs that can speed up AI progress meaningfully, progress on algorithms could go much faster, e.g., by AIs automating the role of researchers at OpenAI. You also get compounding economic returns that allow compute to grow — AIs that can be used to make a bunch of money, and that money can be put into compute. It seems plausible that you can get to that level of AI capabilities in the next few orders of magnitude, e.g., GPT-5 or GPT-6. Automated researchers are crazy.
  - Moore’s law has held for a long time. Edit to add: I think a reasonable breakdown for the “compute” category mentioned above is “money spent” and “FLOP purchasable per dollar”. While Moore’s Law is technically about the density of transistors, the thing we likely care more about is FLOP/$, which follows similar trends.
  - Many people at AGI companies think this picture is right, see e.g., this, this, this (can’t find an aggregation)
- Arguments against:
  - Might run out of data. There are estimated to be 100T-1000T internet tokens, we will likely hit this level in a couple years.
  - Might run out of money — we’ve seen ~$100m training runs, we’re likely at $100m-1b this year, tech R&D budgets are ~30B, governments could fund $1T. One way to avoid this ‘running out of money’ problem is if you get AIs that speed up algorithmic progress sufficiently.
  - Scaling up is a non-trivial engineering problem and it might cause slow downs due to e.g., GPU failure and difficulty parallelizing across thousands of GPUs
  - Revenue might just not be that big and investors might decide it’s not worth the high costs
    OTOH, automating jobs is a big deal if you can get it working
  - Marginal improvements (maybe) for huge increased costs; bad ROI.
    There are numerous other economics arguments against, mainly arguing that huge investments in AI will not be sustainable, see e.g., here
  - Maybe LLMs are missing some crucial thing
    Not doing true generalisation to novel tasks in the ARC-AGI benchmark
    Not able to learn on the fly — maybe long context windows or other improvements can help
    Lack of embodiment might be an issue
  - This is much faster than many AI researchers are predicting
  - This runs counter to many methods of forecasting AI development
  - Will be energy intensive — might see political / social pressures to slow down.
  - We might see slowdowns due to safety concerns.
- Vladimir_Nesov 15 Jul 2024 15:50 UTC
  2 points
  0
  Parent
  
  Might run out of data.
  
  Data is running out for making overtrained models, not Chinchilla-optimal models, because you can repeat data (there’s also a recent hour-long presentation by one of the authors). This systematic study was published only in May 2023, though the Galactica paper from Nov 2022 also has a result to this effect (see Figure 6). The preceding popular wisdom was that you shouldn’t repeat data for language models, so cached thoughts that don’t take this result into account are still plentiful, and also it doesn’t sufficiently rescue highly overtrained models, so the underlying concern still has some merit.
  
  As you repeat data more and more, the Chinchilla multiplier of data/parameters (data in tokens divided by number of active parameters for an optimal use of given compute) gradually increases from 20 to 60 (see the data-constrained efficient frontier curve in Figure 5 that tilts lower on the parameters/data plot, deviating from the Chinchilla efficient frontier line for data without repetition). You can repeat data essentially without penalty about 4 times, efficiently 16 times, and with any use at all 60 times (at some point even increasing parameters while keeping data unchanged starts decreasing rather than increasing performance). This gives a use for up to 100x more compute, compared to Chinchilla optimal use of data that is not repeated, while retaining some efficiency (at 16x repetition of data). Or up to 1200x more compute for the marginally useful 60x repetition of data.
  
  The datasets you currently see at 15-30T tokens scale are still highly filtered compared to available raw data (see Figure 4). The scale feasible within a few years is about 2e28-1e29 FLOPs) (accounting for hypothetical hardware improvement and larger datacenters of early 2030s; this is physical, not effective compute). Chinchilla optimal compute for a 50T token dataset is about 8e26 FLOPs, which turns into 8e28 FLOPs with 16x repetition of data, up to 9e29 FLOPs for the barely useful 60x repetition. Note that sometimes it’s better to perplexity-filter away half of a dataset and repeat it twice than to use the whole original dataset (yellow star in Figure 6; discussion in the presentation), so using highly repeated data on 50T tokens might still outperform less-repeated usage of less-filtered data, which is to say finding 100T tokens by filtering less doesn’t necessarily work at all. There’s also some double descent for repetition (Appendix D; discussion in the presentation), which suggests that it might be possible to overcome the 60x repetition barrier (Appendix E) with sufficient compute or better algorithms.
  
  In any case the OOMs match between what repeated data allows and the compute that’s plausibly available in the near future (4-8 years). There’s also probably a significant amount of data to be found that’s not on the web, and every 2x increase in unique reasonable quality data means 4x increase in compute. Where data gets truly scarce soon is for highly overtrained inference-efficient models.
  - Aaron_Scher 15 Jul 2024 17:00 UTC
    1 point
    0
    Parent
    I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include “use private platform data like messaging apps”), most of these don’t change the picture because they don’t move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
    Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
    - Vladimir_Nesov 15 Jul 2024 17:13 UTC
      2 points
      0
      Parent
      The point is that you need to get quantitative in these estimates to claim that data is running out, since it has to run out compared to available compute, not merely on its own. And the repeated data argument seems by itself sufficient to show that it doesn’t in fact run out in this sense.
      
      Data still seems to be running out for overtrained models, which is a major concern for LLM labs, so from their point of view there is indeed a salient data wall that’s very soon going to become a problem. There are rumors of synthetic data (which often ambiguously gesture at post-training results while discussing the pre-training data wall), but no published research for how something like that improves the situation with pre-training over using repeated data.
- Vladimir_Nesov 15 Jul 2024 14:49 UTC
  2 points
  −8
  Parent
  
  it’s estimated that the efficiency of algorithms has improved about 3x/year
  
  There was about 5x increase since GPT-3 for dense transformers (see Figure 4) and then there’s MoE, so assuming GPT-3 is not much better than the 2017 baseline after anyone seriously bothered to optimize, it’s more like 30% per year, though plausibly slower recently.
  
  The relevant Epoch paper says point estimate for compute efficiency doubling is 8-9 months (Section 3.1, Appendix G), about 2.5x/year. Though I can’t make sense of their methodology, which aims to compare the incomparable. In particular, what good is comparing even transformers without following the Chinchilla protocol (finding minima on isoFLOP plots of training runs with individually optimal learning rates, not continued pre-training with suboptimal learning rates at many points). Not to mention non-transformers where the scaling laws won’t match and so the results of comparison change as we vary the scale, and also many older algorithms probably won’t scale to arbitrary compute at all.
  
  (With JavaScript mostly disabled, the page you linked lists “Compute-efficiency in language models” as 5.1%/year (!!!). After JavaScript is sufficiently enabled, it starts saying “3 ÷/year”, with a ‘÷’ character, though “90% confidence interval: 2 times to 6 times” disambiguates it. In other places on the same page there are figures like “2.4 x/year” with the more standard ‘x’ character for this meaning.)