I’m especially keen to hear responses to this point:
Eliezer: Backtesting this viewpoint on the previous history of computer science, it seems to me to assert that it should be possible to:
Train a pre-Transformer RNN/CNN-based model, not using any other techniques invented after 2017, to GPT-2 levels of performance, using only around 2x as much compute as GPT-2;
Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology.
...
Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power.
OpenPhil: No, that’s totally not what our viewpoint says when you backfit it to past reality. Our model does a great job of retrodicting past reality.
My guess is that Ajeya / OpenPhil would say “The halving-in-costs every 2.5 years is on average, not for everything. Of course there are going to be plenty of things for which algorithmic progress has been much faster. There are also things for which algorithmic progress has been much slower. And we didn’t pull 2.5 out of our ass, we got it from fitting to past data.”
This seems to rebut the specific point EY made but also seems to support his more general skepticism about this method. What we care about is algorithmic progress relevant to AGI or APS-AI, and if that could be orders of magnitude faster or slower than halving every 2.5 years...
The definition of “year Y compute requirements” is complicated in a kind of crucial way here, to attempt to a) account for the fact that you can’t take any amount of compute and turn it into a solution for some task literally instantly, while b) capturing that there still seems to be a meaningful notion of “the compute you need to do some task is decreasing over time.” I go into it in this section of part 1.
First we start with the “year Y technical difficulty of task T:”
In year Y, imagine a largeish team of good researchers (e.g. the size of AlphaGo’s team) is embarking on a dedicated project to solve task T.
They get an amount of $ D dumped on them, which could be more $ than exists in the whole world, like 10 quadrillion or whatever.
With a few years of dedicated effort (e.g. 2-5), plus whatever fungible resources they could buy with D dollars (e.g. terms of compute and data and low-skilled human labor), can that team of researchers produce a program that solves task T? Here we assume that the fungible resources are infinitely available if you pay, so e.g. if you pay a quadrillion dollars you can get an amount of compute that is (FLOP/$ in year Y) * (1 quadrillion), even though we obviously don’t have that many computers.
And the “technical difficulty of task T in year Y” is how big D is for the best plan that the researchers can come up with in that time. What I wrote in the doc was:
The price of the bundle of resources that it would take to implement the cheapest solution to T that researchers could have readily come up with by year Y, given the CS field’s understanding of algorithms and techniques at that time.
And then you have “year Y compute requirements,” which is whatever amount of compute they’d buy with whatever portion of D dollars they spend on compute.
This definition is convoluted, which isn’t ideal, but after thinking about it for ~10 hours it was the best I could do to balance a) and b) above.
With all that said, I actually do think that the team of good researchers could have gotten GPT-level perf with somewhat more compute a couple years ago, and AlphaGo-level perf with significantly more compute several years ago. I’m not sure exactly what ratio would be, but I don’t think it’s many OOMs.
The thing you said about it being an average with a lot of spread is also true. I think a better version of the model would have probability distributions over the algorithmic progress, hardware progress, and spend parameters; I didn’t put that in because the focus of the report was estimating the 2020 compute requirements distribution. I did try some different values for those parameters in my aggressive and conservative estimates but in retrospect the spread was not wide enough on those.
I’m especially keen to hear responses to this point:
My guess is that Ajeya / OpenPhil would say “The halving-in-costs every 2.5 years is on average, not for everything. Of course there are going to be plenty of things for which algorithmic progress has been much faster. There are also things for which algorithmic progress has been much slower. And we didn’t pull 2.5 out of our ass, we got it from fitting to past data.”
This seems to rebut the specific point EY made but also seems to support his more general skepticism about this method. What we care about is algorithmic progress relevant to AGI or APS-AI, and if that could be orders of magnitude faster or slower than halving every 2.5 years...
The definition of “year Y compute requirements” is complicated in a kind of crucial way here, to attempt to a) account for the fact that you can’t take any amount of compute and turn it into a solution for some task literally instantly, while b) capturing that there still seems to be a meaningful notion of “the compute you need to do some task is decreasing over time.” I go into it in this section of part 1.
First we start with the “year Y technical difficulty of task T:”
In year Y, imagine a largeish team of good researchers (e.g. the size of AlphaGo’s team) is embarking on a dedicated project to solve task T.
They get an amount of $ D dumped on them, which could be more $ than exists in the whole world, like 10 quadrillion or whatever.
With a few years of dedicated effort (e.g. 2-5), plus whatever fungible resources they could buy with D dollars (e.g. terms of compute and data and low-skilled human labor), can that team of researchers produce a program that solves task T? Here we assume that the fungible resources are infinitely available if you pay, so e.g. if you pay a quadrillion dollars you can get an amount of compute that is (FLOP/$ in year Y) * (1 quadrillion), even though we obviously don’t have that many computers.
And the “technical difficulty of task T in year Y” is how big D is for the best plan that the researchers can come up with in that time. What I wrote in the doc was:
And then you have “year Y compute requirements,” which is whatever amount of compute they’d buy with whatever portion of D dollars they spend on compute.
This definition is convoluted, which isn’t ideal, but after thinking about it for ~10 hours it was the best I could do to balance a) and b) above.
With all that said, I actually do think that the team of good researchers could have gotten GPT-level perf with somewhat more compute a couple years ago, and AlphaGo-level perf with significantly more compute several years ago. I’m not sure exactly what ratio would be, but I don’t think it’s many OOMs.
The thing you said about it being an average with a lot of spread is also true. I think a better version of the model would have probability distributions over the algorithmic progress, hardware progress, and spend parameters; I didn’t put that in because the focus of the report was estimating the 2020 compute requirements distribution. I did try some different values for those parameters in my aggressive and conservative estimates but in retrospect the spread was not wide enough on those.
Nice, thanks!