The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Is it true that scaling laws are independent of architecture? I don’t know much about scaling laws but that seems surely wrong to me.
e.g. how does RNN scaling compare to transformer scaling
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.