Scaling laws are an important phenomena and probably deeply tied with the nature of intelligence.
I do take issue with the assertion that scaling laws imply slow takeoff. One key takeaway of the modern ML revolution is that specific details of architectures-in-the-narrow-sense* is mostly not that important and compute and data dominate.
The natural implication is that scaling laws are a function of the data distribution—and mostly not of the architecture. Just because we see a ‘smooth, slow’ scaling law on text data doesn’t mean that this will generalize to other domains/situations/ horizons. In fact, I think we should mostly expect this not to be the case.
*I think the jump from architectures-in-the-narrow-sense don’t matter to architectures-in-the-broad-sense don’t matter is often made. I think this obviously not suppored by the evidence we have sofar (despite many claims to the contrary) and likely wrong.
Even architectures-in-the-narrow-sense don’t show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.
So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?
I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don’t match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Hence generation of higher quality data is a plausible way of disrupting the way scaling laws govern slow takeoff. What this data needs to provide is general cognitive competence that therefore applies to the physical world, but that competence doesn’t need to involve initial familiarity with the human world.
So it could be formal proofs on a reasonable distribution of topics, or a superscaled RL system in an environment that sufficiently elicits general reasoning. If the backbone of a dataset shapes representations towards competence, it might transfer to other areas. Thus we get an alien mind that mostly uses natural data as a tool to speak good English and anticipate popular opinion, not as the essential fabric of its own nature.
In the current not-knowing-what-we-are-doing regime, I’m guessing the safer AGIs are scaffolded natural data LLMs, or failing that model-based RL systems that develop in contact with the human world or data. Model-free RL that relies on a synthetic environment to generate enough data risks growing up more alien. Less clear with reasoning that originates in synthetic data for math, grounded in the physical world through natural data being a fraction of datasets for all models in the system (as a kind of multimodality). Such admixing of natural data might even be sufficient to make a model-free RL system less alien.
Scaling laws are an important phenomena and probably deeply tied with the nature of intelligence.
I do take issue with the assertion that scaling laws imply slow takeoff. One key takeaway of the modern ML revolution is that specific details of architectures-in-the-narrow-sense* is mostly not that important and compute and data dominate.
The natural implication is that scaling laws are a function of the data distribution—and mostly not of the architecture. Just because we see a ‘smooth, slow’ scaling law on text data doesn’t mean that this will generalize to other domains/situations/ horizons. In fact, I think we should mostly expect this not to be the case.
*I think the jump from architectures-in-the-narrow-sense don’t matter to architectures-in-the-broad-sense don’t matter is often made. I think this obviously not suppored by the evidence we have sofar (despite many claims to the contrary) and likely wrong.
Even architectures-in-the-narrow-sense don’t show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.
So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?
I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don’t match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.
Is it true that scaling laws are independent of architecture? I don’t know much about scaling laws but that seems surely wrong to me.
e.g. how does RNN scaling compare to transformer scaling
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Hence generation of higher quality data is a plausible way of disrupting the way scaling laws govern slow takeoff. What this data needs to provide is general cognitive competence that therefore applies to the physical world, but that competence doesn’t need to involve initial familiarity with the human world.
So it could be formal proofs on a reasonable distribution of topics, or a superscaled RL system in an environment that sufficiently elicits general reasoning. If the backbone of a dataset shapes representations towards competence, it might transfer to other areas. Thus we get an alien mind that mostly uses natural data as a tool to speak good English and anticipate popular opinion, not as the essential fabric of its own nature.
In the current not-knowing-what-we-are-doing regime, I’m guessing the safer AGIs are scaffolded natural data LLMs, or failing that model-based RL systems that develop in contact with the human world or data. Model-free RL that relies on a synthetic environment to generate enough data risks growing up more alien. Less clear with reasoning that originates in synthetic data for math, grounded in the physical world through natural data being a fraction of datasets for all models in the system (as a kind of multimodality). Such admixing of natural data might even be sufficient to make a model-free RL system less alien.