Why I think scaling laws will continue to drive progress
Epistemic status: This is a thought I had since a while. I never discussed it with anyone in detail; a brief conversation could convince me otherwise.
According to recent reports there seem to be some barriers to continued scaling. We don’t know what exactly is going on, but it seems like scaling up base models doesn’t bring as much new capability as people hope.
However, I think probably they’re still in some way scaling the wrong thing: The model learns to predict a static dataset on the internet; however, what it needs to do later is to interact with users and the world. For performing well in such a task, the model needs to understand the consequences of its actions, which means modeling interventional distributions P(X | do(A)) instead of static data P(X | Y). This is related to causal confusion as an argument against the scaling hypothesis.
This viewpoint suggests that if big labs figure out how to predict observations in an online-way by ongoing interactions of the models with users / the world, then this should drive further progress. It’s possible that labs are already doing this, but I’m not aware of it, and so I guess they haven’t yet fully figured out how to do that.
Tailcalled talked about this two years ago. A model which predicts text does a form of imitation learning. So it is bounded by the text it imitates, and by the intelligence of humans who have written the text. Models which predict future sensory inputs (called “predictive coding” in neuroscience, or “the dark matter of intelligence” by LeCun) don’t have such a limitation, as they predict reality more directly.
I think this misunderstands what discussion of “barriers to continued scaling” is all about. The question is whether we’ll continue to see ROI comparable to recent years by continuing to do the same things. If not, well… there is always, at all times, the possibility that we will figure out some new and different thing to do which will keep capabilities going. Many people have many hypotheses about what those new and different things could be: your guess about interaction is one, inference time compute is another, synthetic data is a third, deeply integrated multimodality is a fourth, and the list goes on. But these are all hypotheses which may or may not pan out, not already-proven strategies, which makes them a very different topic of discussion than the “barriers to continued scaling” of the things which people have already been doing.
This seems right to me, but the discussion of “scaling will plateau” feels like it usually comes bundled with “and the default expectation is that this means LLM-centric-AI will plateau”, which seems like the wrong-belief-to-have, to me.
The paper seems to be about scaling laws for a static dataset as well?
Similar to the initial study of scale in LLMs, we focus on the effect of scaling on a generative pre-training loss (rather than on downstream agent performance, or reward- or representation-centric objectives), in the infinite data regime, on a fixed offline dataset.
To learn to act you’d need to do reinforcement learning, which is massively less data-efficient than the current self-supervised training.
More generally: I think almost everyone thinks that you’d need to scale the right thing for further progress. The question is just what the right thing is if text is not the right thing. Because text encodes highly powerful abstractions (produced by humans and human culture over many centuries) in a very information dense way.
If you look at the Active Inference community there’s a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain’t easy and as you say it is a lot more compute heavy.
I think there’ll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn’t engaged with LLMs wolrd modelling)
Why I think scaling laws will continue to drive progress
Epistemic status: This is a thought I had since a while. I never discussed it with anyone in detail; a brief conversation could convince me otherwise.
According to recent reports there seem to be some barriers to continued scaling. We don’t know what exactly is going on, but it seems like scaling up base models doesn’t bring as much new capability as people hope.
However, I think probably they’re still in some way scaling the wrong thing: The model learns to predict a static dataset on the internet; however, what it needs to do later is to interact with users and the world. For performing well in such a task, the model needs to understand the consequences of its actions, which means modeling interventional distributions P(X | do(A)) instead of static data P(X | Y). This is related to causal confusion as an argument against the scaling hypothesis.
This viewpoint suggests that if big labs figure out how to predict observations in an online-way by ongoing interactions of the models with users / the world, then this should drive further progress. It’s possible that labs are already doing this, but I’m not aware of it, and so I guess they haven’t yet fully figured out how to do that.
What triggered me writing this is that there is a new paper on scaling law for world modeling that’s about exactly what I’m talking about here.
Tailcalled talked about this two years ago. A model which predicts text does a form of imitation learning. So it is bounded by the text it imitates, and by the intelligence of humans who have written the text. Models which predict future sensory inputs (called “predictive coding” in neuroscience, or “the dark matter of intelligence” by LeCun) don’t have such a limitation, as they predict reality more directly.
I think this misunderstands what discussion of “barriers to continued scaling” is all about. The question is whether we’ll continue to see ROI comparable to recent years by continuing to do the same things. If not, well… there is always, at all times, the possibility that we will figure out some new and different thing to do which will keep capabilities going. Many people have many hypotheses about what those new and different things could be: your guess about interaction is one, inference time compute is another, synthetic data is a third, deeply integrated multimodality is a fourth, and the list goes on. But these are all hypotheses which may or may not pan out, not already-proven strategies, which makes them a very different topic of discussion than the “barriers to continued scaling” of the things which people have already been doing.
This seems right to me, but the discussion of “scaling will plateau” feels like it usually comes bundled with “and the default expectation is that this means LLM-centric-AI will plateau”, which seems like the wrong-belief-to-have, to me.
The paper seems to be about scaling laws for a static dataset as well?
To learn to act you’d need to do reinforcement learning, which is massively less data-efficient than the current self-supervised training.
More generally: I think almost everyone thinks that you’d need to scale the right thing for further progress. The question is just what the right thing is if text is not the right thing. Because text encodes highly powerful abstractions (produced by humans and human culture over many centuries) in a very information dense way.
If you look at the Active Inference community there’s a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain’t easy and as you say it is a lot more compute heavy.
I think there’ll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn’t engaged with LLMs wolrd modelling)