Vladimir_Nesov comments on Estimating efficiency improvements in LLM pre-training

Vladimir_Nesov 20 Jan 2024 10:54 UTC
3 points
0
My point is that algorithmic improvements (in the way I defined them) are very limited, even in the long term, and that there hasn’t been a lot of algorithmic improvement in this sense in the past as well. The issue is that the details of this definition matter, if you start relaxing them and start interpreting “algorithmic improvement” more informally, you become able to see more algorithmic improvement in the past (and potentially in the future).

One take is how in the past, data was often limited and carefully prepared, models didn’t scale beyond all reasonable compute, and there wasn’t patience to keep applying compute once improvement seemed to stop during training. So the historical progress is instead the progress in willingness to run larger experiments and in ability to run larger experiments, because you managed to prepare more data or to make your learning setup continue working beyond the scale that seemed reasonable before.

This isn’t any algorithmic progress at all in the sense I discuss, and so what we instead observe about algorithmic progress in the sense I discuss is its historical near-absence, suggesting that it won’t be appearing in the future either. You might want to take a look at Mosaic and their composer, they’ve done the bulk of work on collecting the real low-hanging algorithmic improvement fruit (just be careful about the marketing that looks at the initial fruit-picking spree and presents it as endless bounty).

Data doesn’t have this problem, that’s how we have 50M parameter chess or Go models that match good human performance even though they are tiny by LLM standards. Different data sources confer different levels of competence on the models, even as you still need approximately the same scale to capture a given source in a model (due to lack of significant algorithmic improvement). But nobody knows how to scalably generate better data for LLMs. There’s Microsoft’s phi that makes some steps in that direction by generating synthetic exercises, which are currently being generated by stronger models trained on much more data than the amount of exercises being generated. Possibly this kind of thing might eventually take off and become self-sufficient, so that a model manages to produce a stronger model. Or alternatively, there might be some way to make use of the disastrous CommonCrawl to generate a comparable amount of data of the usual good level of quality found in better sources of natural data, without exceeding the level of quality found in good natural data. And then there’s RL, which can be thought of as a way of generating data (with model-free RL through hands-on exploration of the environment that should be synthetic to get enough data, and with model-based RL through dreaming about the environment, which might even be the real world in practice).

But in any case, the theory of improvement here is a better synthetic data generation recipe that makes a better source, not a better training recipe for how to capture a source into a model.