I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
If this rate of improvement were to continue into the next century, the 10 teraops required for a humanlike computer would be available in a $10 million supercomputer before 2010 and in a $1,000 personal computer by 2030.”
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
Moravec, “Mind Children”, page 68: “Human equivalence in 40 years”. There he is actually talking about human-level intelligent machines arriving by 2028 - not just the hardware you would theoretically require to build one if you had the ten million dollars to spend on it.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
The robot who will work alongside us in half a century will have some interesting properties. Its reasoning abilities should be astonishingly better than a human’s—even today’s puny systems are much better in some areas. But its perceptual and motor abilities will probably be comparable to ours. Most interestingly, this artificial person will be highly changeable, both as an individual and from one of its generations to the next. But solitary, toiling robots, however competent, are only part of the story. Today, and for some decades into the future, the most effective computing machines work as tools in human hands. As the machinery grows in flexibility and initiative, this association between humans and machines will be more properly described as a partnership. In time, the relationship will become much more intimate, a symbiosis where the boundary between the “natural” and the “artificial” partner is no longer evident. This collaborative route is interesting for its powerful human consequences even if, as I believe, it will matter little in the long run whether or not humans are an intimate part of the evolving artificial intelligences.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
A frightening explanation is that the universe is prowled by stealthy wolves that prey on fledgling technological races. The only civilizations that survive long would be ones that avoid detection by staying very quiet. But wouldn’t the wolves be more technically advanced than their prey and if so what could they gain from their raids? Our autonomous-message idea suggests an odd answer The wolves may be simply helpless bits of data that, in the absence of civilizations, can only lie dormant in multimillion-year trips between galaxies or even inscribed on rocks. Only when a newly evolved, country bumpkin of a technological civilization stumbles and naively acts on one does its eons-old sophistication and ruthlessness, honed over the bodies of countless past victims, become apparent. Then it engineers a reproductive orgy that kills its host and propagates astronomical numbers of copies of itself into the universe, each capable only of waiting patiently for another victim to arise. It is a strategy already familiar to us on a small scale, for it is used by the viruses that plague biological organisms.
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.
I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
My model is something like:
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
For an example of something I consider supports this model, see the LSTM versus transformer graphs in https://arxiv.org/pdf/2001.08361.pdf
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
Assuming evolution also cannot discover algorithms outside our class of discoverable algorithms.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
Those Fermi Paradox musings sound like the plot of A Fire Upon the Deep!
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
I don’t understand this.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
These are three separate things:
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.