Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.
So if you’re going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it’s best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it’s the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).
I don’t understand the logical leap from “human labor applied to AI didn’t grow much” to “we can ignore human labor”. The amount of labor invested in AI research is related to the the time derivative of progress on the algorithms axis. Labor held constant is not the same as algorithms held constant. So, we are still talking about the problem of predicting when AI-capability(algorithms(t),compute(t)) reaches human level. What do you know about the function “AI-capability” that allows you to ignore its dependence on the 1st argument?
Or maybe you’re saying that algorithmic improvements have not been very important in practice? Surely such a claim is not compatible with e.g. the transitions from GOFAI to “shallow” ML to deep ML?
A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with ‘plausible decompositions.’ This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu’s data on the relative predictive power of computer and labor in individual papers and subfields.
In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes.
I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
If this rate of improvement were to continue into the next century, the 10 teraops required for a humanlike computer would be available in a $10 million supercomputer before 2010 and in a $1,000 personal computer by 2030.”
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
Moravec, “Mind Children”, page 68: “Human equivalence in 40 years”. There he is actually talking about human-level intelligent machines arriving by 2028 - not just the hardware you would theoretically require to build one if you had the ten million dollars to spend on it.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
The robot who will work alongside us in half a century will have some interesting properties. Its reasoning abilities should be astonishingly better than a human’s—even today’s puny systems are much better in some areas. But its perceptual and motor abilities will probably be comparable to ours. Most interestingly, this artificial person will be highly changeable, both as an individual and from one of its generations to the next. But solitary, toiling robots, however competent, are only part of the story. Today, and for some decades into the future, the most effective computing machines work as tools in human hands. As the machinery grows in flexibility and initiative, this association between humans and machines will be more properly described as a partnership. In time, the relationship will become much more intimate, a symbiosis where the boundary between the “natural” and the “artificial” partner is no longer evident. This collaborative route is interesting for its powerful human consequences even if, as I believe, it will matter little in the long run whether or not humans are an intimate part of the evolving artificial intelligences.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
A frightening explanation is that the universe is prowled by stealthy wolves that prey on fledgling technological races. The only civilizations that survive long would be ones that avoid detection by staying very quiet. But wouldn’t the wolves be more technically advanced than their prey and if so what could they gain from their raids? Our autonomous-message idea suggests an odd answer The wolves may be simply helpless bits of data that, in the absence of civilizations, can only lie dormant in multimillion-year trips between galaxies or even inscribed on rocks. Only when a newly evolved, country bumpkin of a technological civilization stumbles and naively acts on one does its eons-old sophistication and ruthlessness, honed over the bodies of countless past victims, become apparent. Then it engineers a reproductive orgy that kills its host and propagates astronomical numbers of copies of itself into the universe, each capable only of waiting patiently for another victim to arise. It is a strategy already familiar to us on a small scale, for it is used by the viruses that plague biological organisms.
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.
The chess link maybe should go to hippke’s work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning’s future trajectory.
If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can’t conclude that hardware is the dominant driver of progress by noticing that the cost of compute is dropping rapidly.
But new algorithms also don’t work well on old hardware. That’s evidence in favor of Paul’s view that much software work is adapting to exploit new hardware scales.
I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.
Here are the graphs from Hippke (he or I should publish summary at some point, sorry).
I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute.
So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software.
But instead of asking “how well does hardware/software progress help you get to 1995 performance?” you could ask “how well does hardware/software progress get you to 2015 performance?” and on that metric it looks like software progress is way more important because you basically just can’t scale old algorithms up to modern performance.
The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader.
ETA: but note that the ratio would be much more extreme for Deep Blue, which is another reasonable analogy you might use.
Yeah, the nonlinearity means it’s hard to know what question to ask.
If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I’m totally ignoring constants here), and we assume that compute = et so that conveniently log(compute)=t, thenddtElo=1t+1 . The first term is from compute and the second from software. And so our history is totally not scale-free! There’s some natural timescale set by t=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by software.
Though maybe I shouldn’t spend so much time guessing at the phenomenology of chess, and different problems will have different scaling behavior :P I think this is the case for text models and things like the Winograd schema challenges.
I don’t understand the logical leap from “human labor applied to AI didn’t grow much” to “we can ignore human labor”. The amount of labor invested in AI research is related to the the time derivative of progress on the algorithms axis. Labor held constant is not the same as algorithms held constant. So, we are still talking about the problem of predicting when AI-capability(algorithms(t),compute(t)) reaches human level. What do you know about the function “AI-capability” that allows you to ignore its dependence on the 1st argument?
Or maybe you’re saying that algorithmic improvements have not been very important in practice? Surely such a claim is not compatible with e.g. the transitions from GOFAI to “shallow” ML to deep ML?
A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with ‘plausible decompositions.’ This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu’s data on the relative predictive power of computer and labor in individual papers and subfields.
In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes.
I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
My model is something like:
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
For an example of something I consider supports this model, see the LSTM versus transformer graphs in https://arxiv.org/pdf/2001.08361.pdf
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
Assuming evolution also cannot discover algorithms outside our class of discoverable algorithms.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
Those Fermi Paradox musings sound like the plot of A Fire Upon the Deep!
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
I don’t understand this.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
These are three separate things:
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.
The chess link maybe should go to hippke’s work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning’s future trajectory.
If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can’t conclude that hardware is the dominant driver of progress by noticing that the cost of compute is dropping rapidly.
But new algorithms also don’t work well on old hardware. That’s evidence in favor of Paul’s view that much software work is adapting to exploit new hardware scales.
Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn’t have enough RAM to load it.
I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.
Here are the graphs from Hippke (he or I should publish summary at some point, sorry).
I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute.
So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software.
But instead of asking “how well does hardware/software progress help you get to 1995 performance?” you could ask “how well does hardware/software progress get you to 2015 performance?” and on that metric it looks like software progress is way more important because you basically just can’t scale old algorithms up to modern performance.
The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader.
ETA: but note that the ratio would be much more extreme for Deep Blue, which is another reasonable analogy you might use.
Yeah, the nonlinearity means it’s hard to know what question to ask.
If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I’m totally ignoring constants here), and we assume that compute = et so that conveniently log(compute)=t, thenddtElo=1t+1 . The first term is from compute and the second from software. And so our history is totally not scale-free! There’s some natural timescale set by t=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by software.
Though maybe I shouldn’t spend so much time guessing at the phenomenology of chess, and different problems will have different scaling behavior :P I think this is the case for text models and things like the Winograd schema challenges.