[1] AlphaStar was 10^8 parameters, ten times smaller than a honeybee brain. I think this puts its capabilities in perspective. Yes, it seemed to be more of a heuristic-executor than a long-term planner, because it could occasionally be tricked into doing stupid things repeatedly. But the same is true for insects.
[18] Quick calculation: Suppose we take Ajeya’s best-guess distribution and modify it by lowering the part to the right of 10^35 and raising the part to the left of 10^35, until the 10^35 mark is the 80-percentile mark instead of the 50-percentile mark. And suppose we do this raising and lowering in a “distribution-preserving way,” i.e. the shape of the curve before the 10^35 mark looks exactly the same, it’s just systematically bigger. In other words, we redistribute 30 percentage points of probability mass from above 10^35 to below, in proportion to how the below-10^35 mass is already distributed.
Well, in this case, then 60% of the redistributed mass should end up before the old 30% mark. (Because the 30% mark is 60% of the mass prior to the old median, the 10^35 mark.) And 60% of 30 percentage points is 18, so that means +18 points added before the old 30% mark. This makes it the new 48% mark. So the new 48% mark should be right where the old 30% mark is, which is (eyeballing the spreadsheet) a bit after 2040, 10 years sooner. (Ajeya’s best guess median is a bit after 2050.) This is, I think, a rather conservative estimate of how cruxy this disagreement is.
First, my answer to Question Two is 0.9, not 0.8, and that’s after trying to be humble and whatnot. Second, this procedure of redistributing probability mass in proportion to how it is already distributed produces an obviously silly outcome, where there is a sharp drop-off in probability at the 10^35 mark. Realistically, if you are convinced that the answer to Question Two is 0.8 (or whatever) then you should think the probability distribution tapers off smoothly, being already somewhat low by the time the 80th-percentile mark at 10^35 is reached.
Thus, realistically, someone who mostly agrees with Ajeya but answers Question Two with “0.8” should have somewhat shorter timelines than the median-2040-ish I calculated
[2] This is definitely true for Transformers (and LSTMs I think?), but it may not be true for whatever architecture AlphaStar uses. In particular some people I talked to worry that the vanishing gradients problem might make bigger RL models like OmegaStar actually worse. However, everyone I talked to agreed with the “probably”-qualified version of this claim. I’m very interested to learn more about this.
[3] To avoid catastrophic forgetting, let’s train OmegaStar on all these different games simultaneously, e.g. it plays game A for a short period, then plays game B, then C, etc. and loops back to game A only much later.
[4] Lukas Finnveden points out that Gwern’s extrapolation is pretty weird. Quoting Lukas: “Gwern takes GPT-3′s current performance on lambada; assumes that the loss will fall as fast as it does on “predict-the-next-word” (despite the fact that the lambada loss is currently falling much faster!) and extrapolates current performance (without adjusting for the expected change in scaling law after the crossover point) until the point where the AI is as good as humans (and btw we don’t have a source for the stated human performance)
I’d endorse a summary more like “If progress carries on as it has so far, we might just need ~1e27 FLOP to get to mturk-level of errors on the benchmarks closest to GPT-3′s native predict-the-next-word game. Even if progress on these benchmarks slowed down and improved at the same rate as GPT-3′s generic word-prediction abilities, we’d expect it to happen at ~1e30 FLOP for the lambada benchmark.”
All that being said, Lukas’ own extrapolation seems to confirm the general impression that GPT’s performance will reach human-level around the same time its size reaches brain-size: “Given that Cotra’s model’s median number of parameters is close to my best guess of where near-optimal performance is achieved, the extrapolations do not contradict the model’s estimates, and constitute some evidence for the median being roughly right.”↩︎
[5] One might worry that the original paper had a biased sample of tasks. I do in fact worry about this. However, this paper tests GPT-3 on a sample of actual standardized tests used for admission to colleges, grad schools, etc. and GPT-3 exhibits similar performance (around 50% correct), and also shows radical improvement over smaller versions of GPT.
[6] In theory (and maybe in practice too, given how well the new pre-training paradigm is working? See also e.g. this paper) it should be easier for the model to generalize and understand concepts since it sees images and videos and hears sounds to go along with the text.
[7] GPT-3 has already been used to write its own prompts, sorta. See this paper and look for “metaprompt.” Also, this paper demonstrates the use of stochastic gradient descent on prompts to evolve them into better versions.
[10] Bostrom and Shulman have an earlier estimate with wide error bars: 10^38 − 10^51 FLOP. See page 6 and multiply their FLOPS range by the number of seconds in a year.
[11] Well, we’d definitely start with small brains and scale up, but we’d make sure to spend only a fraction of our overall compute on small brains. From the report, page 25 of part 3: “the number of FLOP/s contributed by humans is (~7e9 humans) * (~1e15 FLOP/s / person) = ~7e24. The human population is vastly larger now than it was during most of our evolutionary history, whereas it is likely that the population of animals with tiny nervous systems has stayed similar. This suggests to me that the average ancestor across our entire evolutionary history was likely tiny and performed very few FLOP/s. I will assume that the “average ancestor” performed about as many FLOP/s as a nematode and the “average population size” was ~1e21 individuals alive at a given point in time. This implies that our ancestors were collectively performing ~1e25 FLOP every second on average over the ~1 billion years of evolutionary history.”
[12] See page 4 of this paper. Relevant quote: “Originally the ST5 mission managers had hired a contractor to design and produce an antenna for this mission. Using conventional design practices the contractor produced a quadrifilar helix antenna (QHA). In Fig. 3 we show performance comparisons of our evolved antennas with the conventionally designed QHA on an ST5 mock-up. Since two antennas are used on each spacecraft – one on the top and one on the bottom – it is important to measure the overall gain pattern with two antennas mounted on the spacecraft. With two QHAs 38% efficiency was achieved, using a QHA with an evolved antenna resulted in 80% efficiency, and using two evolved antennas resulted in 93% efficiency.
Since the evolved antenna does not require a phasing circuit, less design and fabrication work is required, and having fewer parts may result in greater reliability. In terms of overall work, the evolved antenna required approximately three person-months to design and fabricate whereas the conventional antenna required approximately five months. Lastly, the evolved antenna has more uniform coverage in that it has a uniform pattern with only small ripples in the elevations of greatest interest (40◦ − 80◦ ). This allows for reliable performance as the elevation angle relative to the ground changes.”
[14] I mean, definitely not completely random. But I said we’d fill in the details in a random-but-biologically-plausible way. And children simply have far too many neurons for genes to say much about how they connect to each other. Whatever unknowns there are about about how the neurons connect, we can make that part of what’s being optimized by our hundred-thousand-generation search process. The size of the search space can’t be that big, because there isn’t that much space in the human genome to encode any super-complicated instructions. I guess at this point we should start talking about Ajeya’s Genome Anchor idea. I admit I’m out of my depth here.
[15] Since later I talk about how I disagree with Ajeya, I want to make super clear that I really do think her report is excellent. It’s currently the best writing on timelines that I know of. When people ask me to explain my timelines, I say “It’s like Ajeya’s, except…”
[16] I think this because I’ve looked at the probability distribution she gives on page 34 of part 3 of her report and 35 OOMs of floating point operations seems to be the median. I had to do this by measuring with a ruler and doing some eyeballing, so it’s probably not exactly correct, but I’d be surprised if the true answer is more than 55% or less than 45%. As a sanity check, Ajeya has 7 buckets in which her credence is distributed, with 30% in a bucket with median 34.5 OOMs, and 35% in buckets with higher medians, and 35% in buckets with lower medians. (But the lower buckets are closer to 35 than the higher buckets, meaning their tails will be higher than 35 more than the higher bucket’s tails will be lower than 35.
[17] Example: One nitpick I have is that Ajeya projects that the price of compute for AI will fall more slowly than price-performance Moore’s Law, because said law has faltered recently. I instead think we should probably model this uncertainty, with (say) a 50% chance of Moore continuing and a 50% chance of continued slowdown. But even if it was a 100% chance of Moore continuing, this would only bring forward Ajeya’s median timeline to 2045-ish! (At least, according to my tinkering with her spreadsheet)
[19] I recommend trying out different numbers depending on who is in the conversation. For conversations in which everyone assigns high credence to the 10^35 version, it may be more fruitful to debate the 10^29 version, since 10^29 FLOP is when GPT-7 surpasses human level at text prediction (and is also superhuman at the other tests we’ve tried) according to the scaling laws and performance trends, I think.
For conversations where everyone has low credence in the 10^35 version, I suggest using the 10^41 version, since 10^41 FLOP is enough to recapitulate evolution without any shortcuts.
Footnotes
[1] AlphaStar was 10^8 parameters, ten times smaller than a honeybee brain. I think this puts its capabilities in perspective. Yes, it seemed to be more of a heuristic-executor than a long-term planner, because it could occasionally be tricked into doing stupid things repeatedly. But the same is true for insects.
[18] Quick calculation: Suppose we take Ajeya’s best-guess distribution and modify it by lowering the part to the right of 10^35 and raising the part to the left of 10^35, until the 10^35 mark is the 80-percentile mark instead of the 50-percentile mark. And suppose we do this raising and lowering in a “distribution-preserving way,” i.e. the shape of the curve before the 10^35 mark looks exactly the same, it’s just systematically bigger. In other words, we redistribute 30 percentage points of probability mass from above 10^35 to below, in proportion to how the below-10^35 mass is already distributed.
Well, in this case, then 60% of the redistributed mass should end up before the old 30% mark. (Because the 30% mark is 60% of the mass prior to the old median, the 10^35 mark.) And 60% of 30 percentage points is 18, so that means +18 points added before the old 30% mark. This makes it the new 48% mark. So the new 48% mark should be right where the old 30% mark is, which is (eyeballing the spreadsheet) a bit after 2040, 10 years sooner. (Ajeya’s best guess median is a bit after 2050.) This is, I think, a rather conservative estimate of how cruxy this disagreement is.
First, my answer to Question Two is 0.9, not 0.8, and that’s after trying to be humble and whatnot. Second, this procedure of redistributing probability mass in proportion to how it is already distributed produces an obviously silly outcome, where there is a sharp drop-off in probability at the 10^35 mark. Realistically, if you are convinced that the answer to Question Two is 0.8 (or whatever) then you should think the probability distribution tapers off smoothly, being already somewhat low by the time the 80th-percentile mark at 10^35 is reached.
Thus, realistically, someone who mostly agrees with Ajeya but answers Question Two with “0.8” should have somewhat shorter timelines than the median-2040-ish I calculated
[2] This is definitely true for Transformers (and LSTMs I think?), but it may not be true for whatever architecture AlphaStar uses. In particular some people I talked to worry that the vanishing gradients problem might make bigger RL models like OmegaStar actually worse. However, everyone I talked to agreed with the “probably”-qualified version of this claim. I’m very interested to learn more about this.
[3] To avoid catastrophic forgetting, let’s train OmegaStar on all these different games simultaneously, e.g. it plays game A for a short period, then plays game B, then C, etc. and loops back to game A only much later.
[4] Lukas Finnveden points out that Gwern’s extrapolation is pretty weird. Quoting Lukas: “Gwern takes GPT-3′s current performance on lambada; assumes that the loss will fall as fast as it does on “predict-the-next-word” (despite the fact that the lambada loss is currently falling much faster!) and extrapolates current performance (without adjusting for the expected change in scaling law after the crossover point) until the point where the AI is as good as humans (and btw we don’t have a source for the stated human performance)
I’d endorse a summary more like “If progress carries on as it has so far, we might just need ~1e27 FLOP to get to mturk-level of errors on the benchmarks closest to GPT-3′s native predict-the-next-word game. Even if progress on these benchmarks slowed down and improved at the same rate as GPT-3′s generic word-prediction abilities, we’d expect it to happen at ~1e30 FLOP for the lambada benchmark.”
All that being said, Lukas’ own extrapolation seems to confirm the general impression that GPT’s performance will reach human-level around the same time its size reaches brain-size: “Given that Cotra’s model’s median number of parameters is close to my best guess of where near-optimal performance is achieved, the extrapolations do not contradict the model’s estimates, and constitute some evidence for the median being roughly right.”↩︎
[5] One might worry that the original paper had a biased sample of tasks. I do in fact worry about this. However, this paper tests GPT-3 on a sample of actual standardized tests used for admission to colleges, grad schools, etc. and GPT-3 exhibits similar performance (around 50% correct), and also shows radical improvement over smaller versions of GPT.
[6] In theory (and maybe in practice too, given how well the new pre-training paradigm is working? See also e.g. this paper) it should be easier for the model to generalize and understand concepts since it sees images and videos and hears sounds to go along with the text.
[7] GPT-3 has already been used to write its own prompts, sorta. See this paper and look for “metaprompt.” Also, this paper demonstrates the use of stochastic gradient descent on prompts to evolve them into better versions.
[8] Thanks to Connor Leahy for finding this source for me.
[9] 50,000 x 50,000 x 100,000,000 x 10^17 x 6 = 1.5x10^35
[10] Bostrom and Shulman have an earlier estimate with wide error bars: 10^38 − 10^51 FLOP. See page 6 and multiply their FLOPS range by the number of seconds in a year.
[11] Well, we’d definitely start with small brains and scale up, but we’d make sure to spend only a fraction of our overall compute on small brains. From the report, page 25 of part 3: “the number of FLOP/s contributed by humans is (~7e9 humans) * (~1e15 FLOP/s / person) = ~7e24. The human population is vastly larger now than it was during most of our evolutionary history, whereas it is likely that the population of animals with tiny nervous systems has stayed similar. This suggests to me that the average ancestor across our entire evolutionary history was likely tiny and performed very few FLOP/s. I will assume that the “average ancestor” performed about as many FLOP/s as a nematode and the “average population size” was ~1e21 individuals alive at a given point in time. This implies that our ancestors were collectively performing ~1e25 FLOP every second on average over the ~1 billion years of evolutionary history.”
[12] See page 4 of this paper. Relevant quote: “Originally the ST5 mission managers had hired a contractor to design and produce an antenna for this mission. Using conventional design practices the contractor produced a quadrifilar helix antenna (QHA). In Fig. 3 we show performance comparisons of our evolved antennas with the conventionally designed QHA on an ST5 mock-up. Since two antennas are used on each spacecraft – one on the top and one on the bottom – it is important to measure the overall gain pattern with two antennas mounted on the spacecraft. With two QHAs 38% efficiency was achieved, using a QHA with an evolved antenna resulted in 80% efficiency, and using two evolved antennas resulted in 93% efficiency.
Since the evolved antenna does not require a phasing circuit, less design and fabrication work is required, and having fewer parts may result in greater reliability. In terms of overall work, the evolved antenna required approximately three person-months to design and fabricate whereas the conventional antenna required approximately five months. Lastly, the evolved antenna has more uniform coverage in that it has a uniform pattern with only small ripples in the elevations of greatest interest (40◦ − 80◦ ). This allows for reliable performance as the elevation angle relative to the ground changes.”
[13] So, no Ems, basically. Probably.
[14] I mean, definitely not completely random. But I said we’d fill in the details in a random-but-biologically-plausible way. And children simply have far too many neurons for genes to say much about how they connect to each other. Whatever unknowns there are about about how the neurons connect, we can make that part of what’s being optimized by our hundred-thousand-generation search process. The size of the search space can’t be that big, because there isn’t that much space in the human genome to encode any super-complicated instructions. I guess at this point we should start talking about Ajeya’s Genome Anchor idea. I admit I’m out of my depth here.
[15] Since later I talk about how I disagree with Ajeya, I want to make super clear that I really do think her report is excellent. It’s currently the best writing on timelines that I know of. When people ask me to explain my timelines, I say “It’s like Ajeya’s, except…”
[16] I think this because I’ve looked at the probability distribution she gives on page 34 of part 3 of her report and 35 OOMs of floating point operations seems to be the median. I had to do this by measuring with a ruler and doing some eyeballing, so it’s probably not exactly correct, but I’d be surprised if the true answer is more than 55% or less than 45%. As a sanity check, Ajeya has 7 buckets in which her credence is distributed, with 30% in a bucket with median 34.5 OOMs, and 35% in buckets with higher medians, and 35% in buckets with lower medians. (But the lower buckets are closer to 35 than the higher buckets, meaning their tails will be higher than 35 more than the higher bucket’s tails will be lower than 35.
[17] Example: One nitpick I have is that Ajeya projects that the price of compute for AI will fall more slowly than price-performance Moore’s Law, because said law has faltered recently. I instead think we should probably model this uncertainty, with (say) a 50% chance of Moore continuing and a 50% chance of continued slowdown. But even if it was a 100% chance of Moore continuing, this would only bring forward Ajeya’s median timeline to 2045-ish! (At least, according to my tinkering with her spreadsheet)
[19] I recommend trying out different numbers depending on who is in the conversation. For conversations in which everyone assigns high credence to the 10^35 version, it may be more fruitful to debate the 10^29 version, since 10^29 FLOP is when GPT-7 surpasses human level at text prediction (and is also superhuman at the other tests we’ve tried) according to the scaling laws and performance trends, I think.
For conversations where everyone has low credence in the 10^35 version, I suggest using the 10^41 version, since 10^41 FLOP is enough to recapitulate evolution without any shortcuts.