This post points to a rather large update, which I think has not yet propagated through the collective mind of the alignment community. Gains from algorithmic improvement have been roughly comparable to gains from compute and data, and much larger on harder tasks (which are what matter for takeoff).
Yet there’s still an implicit assumption behind lots of alignment discussion that progress is mainly driven by compute. This is most obvious in discussions of a training pause: such proposals are almost always about stopping very large runs only. That would stop increases in training compute. Yet it’s unlikely to slow down algorithmic progress much; algorithmic progress does use lots of compute (i.e. trying stuff out), but it uses lots of compute in many small runs, not big runs. So: even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less (since algorithmic progress is a much larger share than compute and data for harder tasks). Realistically, a pause would buy years at best, not decades.
I’ve personally cited the linked research within the past year. I think this is a major place where the median person in alignment will look back in a few years and say “Man, we should have collectively updated a lot harder on that”.
That assumption is still there because your interpretation is both not true and not justified by this analysis. As I’ve noted several times before, time-travel comparisons like this are useful for forecasting, but are not causal models of research: they cannot tell you the consequence of halting compute growth, because compute causes algorithmic progress. Algorithmic progress does not drop out of the sky by the tick-tock of a clock, it is the fruits of spending a lot of compute on a lot of experiments and trial-and-error and serendipity.
Unless you believe that it is possible to create that algorithmic progress in a void of pure abstract thought with no dirty trial-and-error of the sort which actually creates breakthroughs like Resnets or GPTs, then any breakdown like this relying on ‘compute used in a single run’ or ‘compute used in the benchmark instance’ simply represents a lower bound on the total compute spent to achieve that progress.
Once compute stagnates, so too will ‘algorithmic’ progress, because ‘algorithmic’ is just ‘compute’ in a trenchcoat. Only once the compute shows up, then the overflowing abundance of ideas will be able to be validated and show which one was a good algorithm after all; otherwise, it’s just Schmidhubering into a void and a Trivia Pursuit game like ‘oh, did you know resnets and Densenets were first invented in 1989 and they had shortcut connections well before that? too bad they couldn’t make any use of it then, what a pity’.
It sounds like you did not actually read my comment? I clearly addressed this exact point:
Yet it’s unlikely to slow down algorithmic progress much; algorithmic progress does use lots of compute (i.e. trying stuff out), but it uses lots of compute in many small runs, not big runs.
We are not talking here about a general stagnation of compute, we are talking about some kind of pause on large training runs. Compute will still keep getting cheaper.
If you are trying to argue that algorithmic progress only follows from unprecedently large compute runs, then (a) say that rather than strawmanning me as defending a view in which algorithmic progress is made without experimentation, and (b) that seems clearly false of the actual day-to-day experiments which go into algorithmic progress.
even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.
I mean, we can go look at the things which people do when coming up with new more-efficient transformer algorithms, or figuring out the Chinchilla scaling laws, or whatever. And that mostly looks like running small experiments, and extrapolating scaling curves on those small experiments where relevant. By the time people test it out on a big run, they generally know very well how it’s going to perform.
The place where I’d see the strongest case for dependence on large compute is prompt engineering. But even there, it seems like the techniques which work on GPT-4 also generally work on GPT-3 or 3.5?
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!
This post points to a rather large update, which I think has not yet propagated through the collective mind of the alignment community. Gains from algorithmic improvement have been roughly comparable to gains from compute and data, and much larger on harder tasks (which are what matter for takeoff).
Yet there’s still an implicit assumption behind lots of alignment discussion that progress is mainly driven by compute. This is most obvious in discussions of a training pause: such proposals are almost always about stopping very large runs only. That would stop increases in training compute. Yet it’s unlikely to slow down algorithmic progress much; algorithmic progress does use lots of compute (i.e. trying stuff out), but it uses lots of compute in many small runs, not big runs. So: even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less (since algorithmic progress is a much larger share than compute and data for harder tasks). Realistically, a pause would buy years at best, not decades.
I’ve personally cited the linked research within the past year. I think this is a major place where the median person in alignment will look back in a few years and say “Man, we should have collectively updated a lot harder on that”.
That assumption is still there because your interpretation is both not true and not justified by this analysis. As I’ve noted several times before, time-travel comparisons like this are useful for forecasting, but are not causal models of research: they cannot tell you the consequence of halting compute growth, because compute causes algorithmic progress. Algorithmic progress does not drop out of the sky by the tick-tock of a clock, it is the fruits of spending a lot of compute on a lot of experiments and trial-and-error and serendipity.
Unless you believe that it is possible to create that algorithmic progress in a void of pure abstract thought with no dirty trial-and-error of the sort which actually creates breakthroughs like Resnets or GPTs, then any breakdown like this relying on ‘compute used in a single run’ or ‘compute used in the benchmark instance’ simply represents a lower bound on the total compute spent to achieve that progress.
Once compute stagnates, so too will ‘algorithmic’ progress, because ‘algorithmic’ is just ‘compute’ in a trenchcoat. Only once the compute shows up, then the overflowing abundance of ideas will be able to be validated and show which one was a good algorithm after all; otherwise, it’s just Schmidhubering into a void and a Trivia Pursuit game like ‘oh, did you know resnets and Densenets were first invented in 1989 and they had shortcut connections well before that? too bad they couldn’t make any use of it then, what a pity’.
It sounds like you did not actually read my comment? I clearly addressed this exact point:
We are not talking here about a general stagnation of compute, we are talking about some kind of pause on large training runs. Compute will still keep getting cheaper.
If you are trying to argue that algorithmic progress only follows from unprecedently large compute runs, then (a) say that rather than strawmanning me as defending a view in which algorithmic progress is made without experimentation, and (b) that seems clearly false of the actual day-to-day experiments which go into algorithmic progress.
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.
I mean, we can go look at the things which people do when coming up with new more-efficient transformer algorithms, or figuring out the Chinchilla scaling laws, or whatever. And that mostly looks like running small experiments, and extrapolating scaling curves on those small experiments where relevant. By the time people test it out on a big run, they generally know very well how it’s going to perform.
The place where I’d see the strongest case for dependence on large compute is prompt engineering. But even there, it seems like the techniques which work on GPT-4 also generally work on GPT-3 or 3.5?
shrug
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!