even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.
I mean, we can go look at the things which people do when coming up with new more-efficient transformer algorithms, or figuring out the Chinchilla scaling laws, or whatever. And that mostly looks like running small experiments, and extrapolating scaling curves on those small experiments where relevant. By the time people test it out on a big run, they generally know very well how it’s going to perform.
The place where I’d see the strongest case for dependence on large compute is prompt engineering. But even there, it seems like the techniques which work on GPT-4 also generally work on GPT-3 or 3.5?
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.
I mean, we can go look at the things which people do when coming up with new more-efficient transformer algorithms, or figuring out the Chinchilla scaling laws, or whatever. And that mostly looks like running small experiments, and extrapolating scaling curves on those small experiments where relevant. By the time people test it out on a big run, they generally know very well how it’s going to perform.
The place where I’d see the strongest case for dependence on large compute is prompt engineering. But even there, it seems like the techniques which work on GPT-4 also generally work on GPT-3 or 3.5?
shrug
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!