Before Chinchilla scaling, nobody was solving the relevant optimization problem. Namely, given a perplexity target, adjust all parameters including model size and geometry, sparsity, and amount of data (sampled from a fixed exceedingly large dataset) to hit the perplexity target with as few FLOPs as possible. Do this for multiple perplexities, make a perplexity-FLOP plot of optimized training runs to be able to interpolate. Given a different achitecture with its own different plot, estimated improvement in these FLOPs for each fixed perplexity within some range is then the training efficiency improvement valid within that range of perplexities. This might not be the most important measurement in practice, but it makes the comparison between very different architectures meaningful at least in some sense.
Without MoE, the gains since GPT-3 recipe seem to be about 6x (see Figure 4 in the Mamba paper). I’m not sure what the MoE gains are on top of that, the scaling laws I’ve seen don’t quite measure the thing I’ve defined above (I’d be grateful for a pointer). Going from FP32 to BF16 to something effectively 6-bit with Microscaling (see section 4.5) without loss of training quality is another potential win (if it’s implemented in future hardware, or if there is a software-only way of getting similar results without an overhead).
But I’m not sure there is much more there, once the correct optimization problem is being solved and the low hanging fruit is collected, and if the dataset remains natural. The historical algorithmic progress is misleading, because it wasn’t measuring what happens when you use unlimited data and can vary model size to get as much compression quality as possible out of given compute.
Thanks for the comment! I agree that would be a good way to more systematically measure algorithmic efficiency improvements. You won’t be able to infer the effects of differences in data quality though—or are you suggesting you think those are very limited anyway?
My point is that algorithmic improvements (in the way I defined them) are very limited, even in the long term, and that there hasn’t been a lot of algorithmic improvement in this sense in the past as well. The issue is that the details of this definition matter, if you start relaxing them and start interpreting “algorithmic improvement” more informally, you become able to see more algorithmic improvement in the past (and potentially in the future).
One take is how in the past, data was often limited and carefully prepared, models didn’t scale beyond all reasonable compute, and there wasn’t patience to keep applying compute once improvement seemed to stop during training. So the historical progress is instead the progress in willingness to run larger experiments and in ability to run larger experiments, because you managed to prepare more data or to make your learning setup continue working beyond the scale that seemed reasonable before.
This isn’t any algorithmic progress at all in the sense I discuss, and so what we instead observe about algorithmic progress in the sense I discuss is its historical near-absence, suggesting that it won’t be appearing in the future either. You might want to take a look at Mosaic and their composer, they’ve done the bulk of work on collecting the real low-hanging algorithmic improvement fruit (just be careful about the marketing that looks at the initial fruit-picking spree and presents it as endless bounty).
Data doesn’t have this problem, that’s how we have 50M parameter chess or Go models that match good human performance even though they are tiny by LLM standards. Different data sources confer different levels of competence on the models, even as you still need approximately the same scale to capture a given source in a model (due to lack of significant algorithmic improvement). But nobody knows how to scalably generate better data for LLMs. There’s Microsoft’s phi that makes some steps in that direction by generating synthetic exercises, which are currently being generated by stronger models trained on much more data than the amount of exercises being generated. Possibly this kind of thing might eventually take off and become self-sufficient, so that a model manages to produce a stronger model. Or alternatively, there might be some way to make use of the disastrous CommonCrawl to generate a comparable amount of data of the usual good level of quality found in better sources of natural data, without exceeding the level of quality found in good natural data. And then there’s RL, which can be thought of as a way of generating data (with model-free RL through hands-on exploration of the environment that should be synthetic to get enough data, and with model-based RL through dreaming about the environment, which might even be the real world in practice).
But in any case, the theory of improvement here is a better synthetic data generation recipe that makes a better source, not a better training recipe for how to capture a source into a model.
Before Chinchilla scaling, nobody was solving the relevant optimization problem. Namely, given a perplexity target, adjust all parameters including model size and geometry, sparsity, and amount of data (sampled from a fixed exceedingly large dataset) to hit the perplexity target with as few FLOPs as possible. Do this for multiple perplexities, make a perplexity-FLOP plot of optimized training runs to be able to interpolate. Given a different achitecture with its own different plot, estimated improvement in these FLOPs for each fixed perplexity within some range is then the training efficiency improvement valid within that range of perplexities. This might not be the most important measurement in practice, but it makes the comparison between very different architectures meaningful at least in some sense.
Without MoE, the gains since GPT-3 recipe seem to be about 6x (see Figure 4 in the Mamba paper). I’m not sure what the MoE gains are on top of that, the scaling laws I’ve seen don’t quite measure the thing I’ve defined above (I’d be grateful for a pointer). Going from FP32 to BF16 to something effectively 6-bit with Microscaling (see section 4.5) without loss of training quality is another potential win (if it’s implemented in future hardware, or if there is a software-only way of getting similar results without an overhead).
But I’m not sure there is much more there, once the correct optimization problem is being solved and the low hanging fruit is collected, and if the dataset remains natural. The historical algorithmic progress is misleading, because it wasn’t measuring what happens when you use unlimited data and can vary model size to get as much compression quality as possible out of given compute.
Thanks for the comment! I agree that would be a good way to more systematically measure algorithmic efficiency improvements. You won’t be able to infer the effects of differences in data quality though—or are you suggesting you think those are very limited anyway?
My point is that algorithmic improvements (in the way I defined them) are very limited, even in the long term, and that there hasn’t been a lot of algorithmic improvement in this sense in the past as well. The issue is that the details of this definition matter, if you start relaxing them and start interpreting “algorithmic improvement” more informally, you become able to see more algorithmic improvement in the past (and potentially in the future).
One take is how in the past, data was often limited and carefully prepared, models didn’t scale beyond all reasonable compute, and there wasn’t patience to keep applying compute once improvement seemed to stop during training. So the historical progress is instead the progress in willingness to run larger experiments and in ability to run larger experiments, because you managed to prepare more data or to make your learning setup continue working beyond the scale that seemed reasonable before.
This isn’t any algorithmic progress at all in the sense I discuss, and so what we instead observe about algorithmic progress in the sense I discuss is its historical near-absence, suggesting that it won’t be appearing in the future either. You might want to take a look at Mosaic and their composer, they’ve done the bulk of work on collecting the real low-hanging algorithmic improvement fruit (just be careful about the marketing that looks at the initial fruit-picking spree and presents it as endless bounty).
Data doesn’t have this problem, that’s how we have 50M parameter chess or Go models that match good human performance even though they are tiny by LLM standards. Different data sources confer different levels of competence on the models, even as you still need approximately the same scale to capture a given source in a model (due to lack of significant algorithmic improvement). But nobody knows how to scalably generate better data for LLMs. There’s Microsoft’s phi that makes some steps in that direction by generating synthetic exercises, which are currently being generated by stronger models trained on much more data than the amount of exercises being generated. Possibly this kind of thing might eventually take off and become self-sufficient, so that a model manages to produce a stronger model. Or alternatively, there might be some way to make use of the disastrous CommonCrawl to generate a comparable amount of data of the usual good level of quality found in better sources of natural data, without exceeding the level of quality found in good natural data. And then there’s RL, which can be thought of as a way of generating data (with model-free RL through hands-on exploration of the environment that should be synthetic to get enough data, and with model-based RL through dreaming about the environment, which might even be the real world in practice).
But in any case, the theory of improvement here is a better synthetic data generation recipe that makes a better source, not a better training recipe for how to capture a source into a model.