I think this question is interesting but difficult to answer based on the data we have, because the dataset is so poor when it comes to unusual examples that would really allow us to answer this question with confidence. Our model assumes that they are substitutes, but that’s not based on anything we infer from the data.
Our model is certainly not exactly correct, in the sense that there should be some complementarity between compute and algorithms, but the complementarity probably only becomes noticeable for extreme ratios between the two contributions. One way to think about this is that we can approximate a CES production function
Y(C,A)=(αCρ+(1−α)Aρ)1/ρ
in training compute C and algorithmic efficiency A when C/A≈1 by writing it as
which means the first-order behavior of the function around C/A≈1 doesn’t depend on ρ, which is the parameter that controls complementarity versus substitutability. Since people empirically seem to train models in the regime where C/A is close to 1 this makes it difficult to identify ρ from the data we have, and approximating by a Cobb-Douglas (which is what we do) does about as well as anything else. For this reason, I would caution against using our model to predict the performance of models that have an unusual combination of dataset size, training compute, and algorithmic efficiency.
In general, a more diverse dataset containing models trained with unusual values of compute and data for the year that they were trained in would help our analysis substantially. There are some problems with doing this experiment ourselves: for instance, techniques used to train larger models often perform worse than older methods if we try to scale them down. So there isn’t much drive to make algorithms run really well with small compute and data budgets, and that’s going to bias us towards thinking we’re more bottlenecked by compute and data than we actually are.
In general, as time passes, all the researcheres increase their compute usage at a similar rate. This makes it hard to distinguish between improvements caused by compute and algorithmic progress.
If the correlation between year and compute was perfect, we wouldn’t be able to do this at all.
But there is some variance in how much compute is used in different papers, each year. This variance is large enough that we can estimate the first-order effects of algorithmic progress and compute usage.
But complementarity is a second-order effect, and the data doesn’t contain enough variation/data-points to give a good estimate of second-order effects.
This looks correct to me—this is indeed how the model is able to disentangle algorithmic progress from scaling of training compute budgets.
The problems you mention are even more extreme with dataset size because plenty of the models in our analysis were only trained on ImageNet-1k, which has around 1M images. So more than half of the models in our dataset actually just use the exact same training set, which makes our model highly uncertain about the dataset side of things.
In addition, the way people typically incorporate extra data is by pretraining on bigger, more diverse datasets and then fine-tuning on ImageNet-1k. This is obviously different from sampling more images from the training distribution of ImageNet-1k, though bigger datasets such as ImageNet-21k are constructed on purpose to be similar in distribution to ImageNet-1k. We actually tried to take this into account explicitly by introducing some kind of transfer exponent between different datasets, but this didn’t really work better than our existing model.
One final wrinkle is the irreducible loss of ImageNet. I tried to get some handle on this by reading the literature, and I think I would estimate a lower bound of maybe 1-2% for top 1 accuracy, as this seems to be the fraction of images that have incorrect labels. There’s a bigger fraction of images that could plausibly fit multiple categories at once, but models seem to be able to do substantially better than chance on these examples, and it’s not clear when we can expect this progress to cap out.
Our model specification assumes that in the infinite compute and infinite data limit you reach 100% accuracy. This is probably not exactly right because of irreducible loss, but because models are currently around over 90% top-1 accuracy I think it’s probably not too big of a problem for within-distribution inference, e.g. answering questions such as “how much software progress did we see over the past decade”. Out-of-distribution inference is a totally different game and I would not trust our model with this for a variety of reasons—the biggest reason is really the lack of diversity and the limited size of the dataset and doesn’t have much to do with our choice of model.
To be honest, I think ImageNet-1k is just a bad benchmark for evaluating computer vision models. The reason we have to use it here is that all the better benchmarks that correlate better with real-world use cases have been developed recently and we have no data on how past models perform on these benchmarks. When we were starting this investigation we had to make a tradeoff between benchmark quality and the size & diversity of our dataset, and we ended up going for ImageNet-1k top 1 accuracy for this reason. With better data on superior benchmarks, we would not have made this choice.
I think this question is interesting but difficult to answer based on the data we have, because the dataset is so poor when it comes to unusual examples that would really allow us to answer this question with confidence. Our model assumes that they are substitutes, but that’s not based on anything we infer from the data.
Our model is certainly not exactly correct, in the sense that there should be some complementarity between compute and algorithms, but the complementarity probably only becomes noticeable for extreme ratios between the two contributions. One way to think about this is that we can approximate a CES production function
Y(C,A)=(αCρ+(1−α)Aρ)1/ρ
in training compute C and algorithmic efficiency A when C/A≈1 by writing it as
Y(C,A)=AY(elog(C/A),1)=A(αeρlog(C/A)+(1−α))1/ρ≈A(1+αρlog(C/A))1/ρ≈CαA1−α
which means the first-order behavior of the function around C/A≈1 doesn’t depend on ρ, which is the parameter that controls complementarity versus substitutability. Since people empirically seem to train models in the regime where C/A is close to 1 this makes it difficult to identify ρ from the data we have, and approximating by a Cobb-Douglas (which is what we do) does about as well as anything else. For this reason, I would caution against using our model to predict the performance of models that have an unusual combination of dataset size, training compute, and algorithmic efficiency.
In general, a more diverse dataset containing models trained with unusual values of compute and data for the year that they were trained in would help our analysis substantially. There are some problems with doing this experiment ourselves: for instance, techniques used to train larger models often perform worse than older methods if we try to scale them down. So there isn’t much drive to make algorithms run really well with small compute and data budgets, and that’s going to bias us towards thinking we’re more bottlenecked by compute and data than we actually are.
Interesting, thanks! To check my understanding:
In general, as time passes, all the researcheres increase their compute usage at a similar rate. This makes it hard to distinguish between improvements caused by compute and algorithmic progress.
If the correlation between year and compute was perfect, we wouldn’t be able to do this at all.
But there is some variance in how much compute is used in different papers, each year. This variance is large enough that we can estimate the first-order effects of algorithmic progress and compute usage.
But complementarity is a second-order effect, and the data doesn’t contain enough variation/data-points to give a good estimate of second-order effects.
This looks correct to me—this is indeed how the model is able to disentangle algorithmic progress from scaling of training compute budgets.
The problems you mention are even more extreme with dataset size because plenty of the models in our analysis were only trained on ImageNet-1k, which has around 1M images. So more than half of the models in our dataset actually just use the exact same training set, which makes our model highly uncertain about the dataset side of things.
In addition, the way people typically incorporate extra data is by pretraining on bigger, more diverse datasets and then fine-tuning on ImageNet-1k. This is obviously different from sampling more images from the training distribution of ImageNet-1k, though bigger datasets such as ImageNet-21k are constructed on purpose to be similar in distribution to ImageNet-1k. We actually tried to take this into account explicitly by introducing some kind of transfer exponent between different datasets, but this didn’t really work better than our existing model.
One final wrinkle is the irreducible loss of ImageNet. I tried to get some handle on this by reading the literature, and I think I would estimate a lower bound of maybe 1-2% for top 1 accuracy, as this seems to be the fraction of images that have incorrect labels. There’s a bigger fraction of images that could plausibly fit multiple categories at once, but models seem to be able to do substantially better than chance on these examples, and it’s not clear when we can expect this progress to cap out.
Our model specification assumes that in the infinite compute and infinite data limit you reach 100% accuracy. This is probably not exactly right because of irreducible loss, but because models are currently around over 90% top-1 accuracy I think it’s probably not too big of a problem for within-distribution inference, e.g. answering questions such as “how much software progress did we see over the past decade”. Out-of-distribution inference is a totally different game and I would not trust our model with this for a variety of reasons—the biggest reason is really the lack of diversity and the limited size of the dataset and doesn’t have much to do with our choice of model.
To be honest, I think ImageNet-1k is just a bad benchmark for evaluating computer vision models. The reason we have to use it here is that all the better benchmarks that correlate better with real-world use cases have been developed recently and we have no data on how past models perform on these benchmarks. When we were starting this investigation we had to make a tradeoff between benchmark quality and the size & diversity of our dataset, and we ended up going for ImageNet-1k top 1 accuracy for this reason. With better data on superior benchmarks, we would not have made this choice.