Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).
Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.
So if you’re going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it’s best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it’s the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).
Thinking about hardware has a lot of helpful implications for constraining timelines:
Evolutionary anchors, combined with paleontological and other information (if you’re worried about Rare Earth miracles), mostly cut off extremely high input estimates for AGI development, like Robin Hanson’s, and we can say from known human advantages relative to evolution that credence should be suppressed some distance short of that (moreso with more software progress)
You should have lower a priori credence in smaller-than-insect brains yielding AGI than more middle of the range compute budgets
It lets you see you should concentrate probability mass in the next decade or so because of the rapid scaleup of compute investment (with a supporting argument from the increased growth of AI R&D effort) covering a substantial share of the orders of magnitude between where we are and levels that we should expect are overkill
It gets you likely AGI this century, and on the closer part of that, with a pretty flat prior over orders of magnitude of inputs that will go into success of magnitude of inputs
It suggests lower annual probability later on if Moore’s Law and friends are dead, with stagnant inputs to AI
These are all useful things highlighted by Ajeya’s model, and by earlier work like Moravec’s. In particular, I think Moravec’s forecasting methods are looking pretty good, given the difficulty of the problem. He and Kurzweil (like the computing industry generally) were surprised by the death of Dennard scaling and general price-performance of computing growth slowing, and we’re definitely years behind his forecasts in AI capability, but we are seeing a very compute-intensive AI boom in the right region of compute space. Moravec also did anticipate it would take a lot more compute than one lifetime run to get to AGI. He suggested human-level AGI would be in the vicinity of human-like compute quantities being cheap and available for R&D. This old discussion is flawed, but makes me feel the dialogue is straw-manning Moravec to some extent.
Ajeya’s model puts most of the modeling work on hardware, but it is intentionally expressive enough to let you represent a lot of different views about software research progress, you just have to contribute more of that yourself when adjusting weights on the different scenarios, or effective software contribution year by year. You can even represent a breakdown of the expectation that software and hardware significantly trade off over time, and very specific accounts of the AI software landscape and development paths. Regardless modeling the most importantly changing input to AGI is useful, and I think this dialogue misleads with respect to that by equivocating between hardware not being the only contributing factor and not being an extremely important to dominant driver of progress.
I commend this comment and concur with the importance of hardware, the straw-manning of Moravec, etc.
However I do think that EY had a few valid criticisms of Ajeya’s model in particular—it ends up smearing probability mass over many anchors or sub-models, most of which are arguably poorly grounded in deep engineering knowledge. And yes you can use it to create your own model, but most people won’t do that and are just looking at the default median conclusion.
Moore’s Law is petering out as we run up against the constraints of physics for practical irreversible computers, but the brain is also—at best—already at those same limits. So that should substantially reduce uncertainty concerning the hardware side (hardware parity now/soon), and thus place most of the uncertainty around software/algorithm iteration progress. The important algorithmic advances tend to change asymptotic scaling curvature rather than progress linearly, and really all the key uncertainty is over that—which I think is what EY is gesturing at, and rightly so.
Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.
So if you’re going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it’s best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it’s the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).
I don’t understand the logical leap from “human labor applied to AI didn’t grow much” to “we can ignore human labor”. The amount of labor invested in AI research is related to the the time derivative of progress on the algorithms axis. Labor held constant is not the same as algorithms held constant. So, we are still talking about the problem of predicting when AI-capability(algorithms(t),compute(t)) reaches human level. What do you know about the function “AI-capability” that allows you to ignore its dependence on the 1st argument?
Or maybe you’re saying that algorithmic improvements have not been very important in practice? Surely such a claim is not compatible with e.g. the transitions from GOFAI to “shallow” ML to deep ML?
A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with ‘plausible decompositions.’ This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu’s data on the relative predictive power of computer and labor in individual papers and subfields.
In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes.
I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
If this rate of improvement were to continue into the next century, the 10 teraops required for a humanlike computer would be available in a $10 million supercomputer before 2010 and in a $1,000 personal computer by 2030.”
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
Moravec, “Mind Children”, page 68: “Human equivalence in 40 years”. There he is actually talking about human-level intelligent machines arriving by 2028 - not just the hardware you would theoretically require to build one if you had the ten million dollars to spend on it.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
The robot who will work alongside us in half a century will have some interesting properties. Its reasoning abilities should be astonishingly better than a human’s—even today’s puny systems are much better in some areas. But its perceptual and motor abilities will probably be comparable to ours. Most interestingly, this artificial person will be highly changeable, both as an individual and from one of its generations to the next. But solitary, toiling robots, however competent, are only part of the story. Today, and for some decades into the future, the most effective computing machines work as tools in human hands. As the machinery grows in flexibility and initiative, this association between humans and machines will be more properly described as a partnership. In time, the relationship will become much more intimate, a symbiosis where the boundary between the “natural” and the “artificial” partner is no longer evident. This collaborative route is interesting for its powerful human consequences even if, as I believe, it will matter little in the long run whether or not humans are an intimate part of the evolving artificial intelligences.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
A frightening explanation is that the universe is prowled by stealthy wolves that prey on fledgling technological races. The only civilizations that survive long would be ones that avoid detection by staying very quiet. But wouldn’t the wolves be more technically advanced than their prey and if so what could they gain from their raids? Our autonomous-message idea suggests an odd answer The wolves may be simply helpless bits of data that, in the absence of civilizations, can only lie dormant in multimillion-year trips between galaxies or even inscribed on rocks. Only when a newly evolved, country bumpkin of a technological civilization stumbles and naively acts on one does its eons-old sophistication and ruthlessness, honed over the bodies of countless past victims, become apparent. Then it engineers a reproductive orgy that kills its host and propagates astronomical numbers of copies of itself into the universe, each capable only of waiting patiently for another victim to arise. It is a strategy already familiar to us on a small scale, for it is used by the viruses that plague biological organisms.
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.
The chess link maybe should go to hippke’s work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning’s future trajectory.
If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can’t conclude that hardware is the dominant driver of progress by noticing that the cost of compute is dropping rapidly.
But new algorithms also don’t work well on old hardware. That’s evidence in favor of Paul’s view that much software work is adapting to exploit new hardware scales.
I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.
Here are the graphs from Hippke (he or I should publish summary at some point, sorry).
I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute.
So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software.
But instead of asking “how well does hardware/software progress help you get to 1995 performance?” you could ask “how well does hardware/software progress get you to 2015 performance?” and on that metric it looks like software progress is way more important because you basically just can’t scale old algorithms up to modern performance.
The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader.
ETA: but note that the ratio would be much more extreme for Deep Blue, which is another reasonable analogy you might use.
Yeah, the nonlinearity means it’s hard to know what question to ask.
If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I’m totally ignoring constants here), and we assume that compute = et so that conveniently log(compute)=t, thenddtElo=1t+1 . The first term is from compute and the second from software. And so our history is totally not scale-free! There’s some natural timescale set by t=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by software.
Though maybe I shouldn’t spend so much time guessing at the phenomenology of chess, and different problems will have different scaling behavior :P I think this is the case for text models and things like the Winograd schema challenges.
(I’m trying to answer and clarify some of the points in the comments based on my interpretation of Yudkowsky in this post. So take the interpretations with a grain of salt, not as “exactly what Yudkowsky meant”)
Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).
Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.
So if you’re going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it’s best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it’s the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).
My summary of what you’re defending here: because hardware progress is (according to you) the major driver of AI innovation, then we should invest a lot of our forecasting resources into forecasting it, and we should leverage it as the strongest source of evidence available for thinking about AGI timelines.
I feel like this is not in contradiction with what Yudkowsky wrote in this post? I doubt he agrees that just additional compute is the main driver of progress (after all, the Bitter Lesson mostly tells you that insights and innovations leveraging more compute will beat hardcorded ones), but insofar as he expect us to have next to no knowledge of how to build AGI until around 2 years before it is done (and then only for those with the Thelian secret), then compute is indeed the next best thing that we have to estimate timelines.
Yet Yudkowsky’s point is that being the next best thing doesn’t mean it’s any good.
Thinking about hardware has a lot of helpful implications for constraining timelines:
Evolutionary anchors, combined with paleontological and other information (if you’re worried about Rare Earth miracles), mostly cut off extremely high input estimates for AGI development, like Robin Hanson’s, and we can say from known human advantages relative to evolution that credence should be suppressed some distance short of that (moreso with more software progress)
Evolution being an upper bound makes sense, and I think Yudkowsky agrees. But it’s an upper bound on the whole human optimization process, and the search space of the human optimization is tricky to think about. I see much of Yudkowsky’s criticisms of biological estimates here as saying “this biological anchor doesn’t express the cost of evolution’s optimization in terms of human optimization, but instead goes for a proxy which doesn’t tell you anything”.
So if someone captured both evolution and human optimization in the same search space, and found an upper bound on the cost (in terms of optimization power) that evolution spent to find humans, then I expect Yudkowsky would agree that this is an upper bound for the optimization power that human will use. But he might still retort that translating optimization power into compute is not obvious.
You should have lower a priori credence in smaller-than-insect brains yielding AGI than more middle of the range compute budgets
Okay, I’m going to propose what I think is the chain of arguments you’re using here:
Currently, we can train what sounds like the compute equivalent of insect brains, and yet we don’t have AGI. Hence we’re not currently able to build AGI with “smaller-than-insect brains”, which means AGI is less likely to be created with “smaller-than-insect brains”.
I agree that we don’t have AGI
The “compute equivalent” stuff is difficult, as I mentioned above, but I don’t think this is the main issue here.
Going from “we don’t know how to do that now” to “we should expect that it is not how we will do it” doesn’t really work IMO. As Yudkowsky points out, the requirements for AGI are constantly dropping, and maybe a new insight will turn out to make smaller neural nets far more powerful, before the bigger models reach AGI
Evolution created insect-sized brains and they were clearly not AGI, so we have evidence against AGI with that amount of resources.
Here the fact that evolution is far worse an optimizer than humans breaks most of the connection between evolution creating insects and humans creating AGI. Evolution merely shows that insects can be made with insect-sized brains, not that AGI cannot be extracted by better use of the same resources.
From my perspective this is exactly what Yudkowsky is arguing against in this post: it’s not because you know of a bunch of paths through search space that you know what a cleverer optimizer could find. There are ways to use a bunch of paths as data to understand the search space, but you then need either to argue that they are somehow dense in the search space, or that the sort of paths you’re interested in look similar to this bunch of paths. And at the moment, I don’t see an argument in any of these forms.
By default we should expect AGI to have a decent minimal size because of it’s complexity, hence smaller models have a lower credence.
Agree with the principle (sounds improbable that AGI will be made in 10 lines of LISP), but the threshold is where most of the difficulty lies: how much is too little? A 100 neurons sounds clearly too small, but when you reach insect-sized brains, it’s not obvious (at least to me) that better use of resources couldn’t bring you most of the way to AGI.
(I wonder if there’s an availability bias here where the only good models we have nowadays are huge, hence we expect that AGI must be a huge model?)
It lets you see you should concentrate probability mass in the next decade or so because of the rapid scaleup of compute investment (with a supporting argument from the increased growth of AI R&D effort) covering a substantial share of the orders of magnitude between where we are and levels that we should expect are overkill
I think this is where the crux of can the current paradigm just scale matters a lot. The main point Yudkowsky uses in the dialogue to argue against your concentration of probability mass is that he doesn’t agree that deep learning scales that way to AGI. In his view (on which I’m not clear yet, and that’s not a view that I’ve seen anyone who actually studies LMs have), the increase in performance will break before. And as such, the concentration of probability mass shouldn’t happen, because the fact that you can reach the anchor is irrelevant since we don’t know a way to turn compute into AGI (according to Yudkowsky’s view).
It gets you likely AGI this century, and on the closer part of that, with a pretty flat prior over orders of magnitude of inputs that will go into success of magnitude of inputs
Here too, it depends on transforming the optimization power of evolution into compute and other requirements, and then know how this compute is supposed to get transformed into efficiency and AGI. (That being said, I think Yudkowsky agrees with the conclusion, just not that specific way of reaching it).
It suggests lower annual probability later on if Moore’s Law and friends are dead, with stagnant inputs to AI
Not clear to me what you mean here (might be clearer with the right link to the section of Cotra’s report about this). But note that based on Yudkowsky’s model in this post, the cost to make AGI should continue to drop as long as the world doesn’t end, which creates a weird situation where the probability of AGI keeps increasing with time (Not sure how to turn that into a distribution though...)
These are all useful things highlighted by Ajeya’s model, and by earlier work like Moravec’s. In particular, I think Moravec’s forecasting methods are looking pretty good, given the difficulty of the problem. He and Kurzweil (like the computing industry generally) were surprised by the death of Dennard scaling and general price-performance of computing growth slowing, and we’re definitely years behind his forecasts in AI capability, but we are seeing a very compute-intensive AI boom in the right region of compute space. Moravec also did anticipate it would take a lot more compute than one lifetime run to get to AGI. He suggested human-level AGI would be in the vicinity of human-like compute quantities being cheap and available for R&D. This old discussion is flawed, but makes me feel the dialogue is straw-manning Moravec to some extent.
This is in the same spirit as a bunch of comments on this post, and I feel like it’s missing the point of the post? Like, it’s not about Moravec’s estimate being wildly wrong, it’s about the unsoundedness of the methods by which Moravec reaches his conclusion. Your analysis doesn’t give such evidence for Moravec predicting accuracy that we should expect he has a really strong method that just looks bad to Yudkowksy but is actually sound. And I feel points like that don’t go at all for the cruxes (the soundness of the method), instead they mostly correct a “too harsh judgment” by Yudkowsky, without invalidating his points.
Ajeya’s model puts most of the modeling work on hardware, but it is intentionally expressive enough to let you represent a lot of different views about software research progress, you just have to contribute more of that yourself when adjusting weights on the different scenarios, or effective software contribution year by year. You can even represent a breakdown of the expectation that software and hardware significantly trade off over time, and very specific accounts of the AI software landscape and development paths. Regardless modeling the most importantly changing input to AGI is useful, and I think this dialogue misleads with respect to that by equivocating between hardware not being the only contributing factor and not being an extremely important to dominant driver of progress.
Hum, my impression here is that Yudkowsky is actually arguing that he is modeling AGI timelines that way; and if you don’t add unwarranted assumptions and don’t misuse the analogies to biological anchors, then you get his model, which is completely unable to give the sort of answer Cotra’s model is outputting.
Or said differently, I expect that Yudkowsky thinks that if you reason correctly and only use actual evidence instead of unsound lines of reasoning, you get his model; but doing that in the explicit context of biological anchors is like trying to quit sugar in a sweetshop: the whole setting just makes that far harder. And given that he expects that he get the right constraints on models without the biological anchors stuff, then it’s completely redundant AND unhelpful.
Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).
Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.
So if you’re going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it’s best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it’s the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).
Thinking about hardware has a lot of helpful implications for constraining timelines:
Evolutionary anchors, combined with paleontological and other information (if you’re worried about Rare Earth miracles), mostly cut off extremely high input estimates for AGI development, like Robin Hanson’s, and we can say from known human advantages relative to evolution that credence should be suppressed some distance short of that (moreso with more software progress)
You should have lower a priori credence in smaller-than-insect brains yielding AGI than more middle of the range compute budgets
It lets you see you should concentrate probability mass in the next decade or so because of the rapid scaleup of compute investment (with a supporting argument from the increased growth of AI R&D effort) covering a substantial share of the orders of magnitude between where we are and levels that we should expect are overkill
It gets you likely AGI this century, and on the closer part of that, with a pretty flat prior over orders of magnitude of inputs that will go into success of magnitude of inputs
It suggests lower annual probability later on if Moore’s Law and friends are dead, with stagnant inputs to AI
These are all useful things highlighted by Ajeya’s model, and by earlier work like Moravec’s. In particular, I think Moravec’s forecasting methods are looking pretty good, given the difficulty of the problem. He and Kurzweil (like the computing industry generally) were surprised by the death of Dennard scaling and general price-performance of computing growth slowing, and we’re definitely years behind his forecasts in AI capability, but we are seeing a very compute-intensive AI boom in the right region of compute space. Moravec also did anticipate it would take a lot more compute than one lifetime run to get to AGI. He suggested human-level AGI would be in the vicinity of human-like compute quantities being cheap and available for R&D. This old discussion is flawed, but makes me feel the dialogue is straw-manning Moravec to some extent.
Ajeya’s model puts most of the modeling work on hardware, but it is intentionally expressive enough to let you represent a lot of different views about software research progress, you just have to contribute more of that yourself when adjusting weights on the different scenarios, or effective software contribution year by year. You can even represent a breakdown of the expectation that software and hardware significantly trade off over time, and very specific accounts of the AI software landscape and development paths. Regardless modeling the most importantly changing input to AGI is useful, and I think this dialogue misleads with respect to that by equivocating between hardware not being the only contributing factor and not being an extremely important to dominant driver of progress.
I commend this comment and concur with the importance of hardware, the straw-manning of Moravec, etc.
However I do think that EY had a few valid criticisms of Ajeya’s model in particular—it ends up smearing probability mass over many anchors or sub-models, most of which are arguably poorly grounded in deep engineering knowledge. And yes you can use it to create your own model, but most people won’t do that and are just looking at the default median conclusion.
Moore’s Law is petering out as we run up against the constraints of physics for practical irreversible computers, but the brain is also—at best—already at those same limits. So that should substantially reduce uncertainty concerning the hardware side (hardware parity now/soon), and thus place most of the uncertainty around software/algorithm iteration progress. The important algorithmic advances tend to change asymptotic scaling curvature rather than progress linearly, and really all the key uncertainty is over that—which I think is what EY is gesturing at, and rightly so.
I don’t understand the logical leap from “human labor applied to AI didn’t grow much” to “we can ignore human labor”. The amount of labor invested in AI research is related to the the time derivative of progress on the algorithms axis. Labor held constant is not the same as algorithms held constant. So, we are still talking about the problem of predicting when AI-capability(algorithms(t),compute(t)) reaches human level. What do you know about the function “AI-capability” that allows you to ignore its dependence on the 1st argument?
Or maybe you’re saying that algorithmic improvements have not been very important in practice? Surely such a claim is not compatible with e.g. the transitions from GOFAI to “shallow” ML to deep ML?
A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with ‘plausible decompositions.’ This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu’s data on the relative predictive power of computer and labor in individual papers and subfields.
In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes.
I will have to look at these studies in detail in order to understand, but I’m confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?
My model is something like:
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
For an example of something I consider supports this model, see the LSTM versus transformer graphs in https://arxiv.org/pdf/2001.08361.pdf
Hmm… Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is “best algorithm” is interpreted to mean “best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class”.
This can justify biological anchors as upper bounds[1]: if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec’s estimate is already in the past and there’s still no human-level AI. Then there is the “lifetime” anchor from Cotra’s report which predicts a very short timeline. Finally, there is the “evolution” anchor which predicts a relatively long timeline.
However, in Cotra’s report most of the weight is assigned to the “neural net” anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the “genome” anchor in which the ANN is genome-sized). This is something that I don’t see how to justify using Mark’s model. On Mark’s model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function.
Assuming evolution also cannot discover algorithms outside our class of discoverable algorithms.
What Moravec says is merely that $1k human-level compute will become available in the ’2020s’, and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don’t hit cheap human-compute until the end of the decade. He also doesn’t commit to how long it takes to turn compute into powerful systems, it’s more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn’t start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
We already know how much compute we have, so we don’t need Moravec’s projections for this? If Yudkowsky described Moravec’s analysis correctly, then Moravec’s threshold was crossed in 2008. Or, by “other extrapolations” you mean other estimates of human brain compute? Cotra’s analysis is much more recent and IIUC she puts the “lifetime anchor” (a more conservative approach than Moravec’s) at about one order of magnitude above the biggest models currently used.
Now, the seeds take time to sprout, but according to Mark’s model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I’d like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)
As far as I can tell, the source for your attribution of this “prediction” is:
As far as I could tell it sounds from the surrounding text like his “prediction” for transformative impacts from AI was something like “between 2010 and 2030″ with broad error bars.
Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.
I just went and skimmed Mind Children. He’s predicting human-equivalent computational power on a personal computer in 40 years. He seems to say that humans will within 50 years be surpassed in every important way by machines (page 70, below), but I haven’t found a more precise or short-term statement yet.
Also, unimportant but cool: Check out his musing about the Fermi Paradox:
While this theory is not nearly as good as the theory I prefer (life is hard, aliens are rare) it strikes me as comparably plausible to the Dark Forest theory. I wonder why I hadn’t heard of it before.
Those Fermi Paradox musings sound like the plot of A Fire Upon the Deep!
actually, the premise of david brin’s existence is a close match to moravec’s paragraph (not a coincidence, i bet, given that david hung around similar circles).
The way that you would think about NN anchors in my model (caveat that this isn’t my whole model):
You have some distribution over 2020-FLOPS-equivalent that TAI needs.
Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
The function from 20XX to the 1:N ratio is relatively predictable, e.g. a “smooth” exponential with respect to time.
Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent.
E.g. in (some smallish) parts of my view, you take observations like “AGI will use compute more efficiently than human brains” and can ask questions like “but how much is the efficiency of compute->cognition increasing over time?” and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
I don’t understand this.
What is the meaning of “2020-FLOPS-equivalent that TAI needs”? Plausibly you can’t build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
What is the meaning of “20XX-FLOPS convert to 2020-FLOPS-equivalent”? If 2020 algorithms hit DMR, you can’t match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.
Maybe you’re talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have “real life elo” for modern algorithms that we can compare to human “real life elo”? Even if we did, this is not what Cotra is doing with her “neural anchor”.
I think 10^35 would probably be enough. This post gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya’s report. If you think it’s very unlikely that 10^35 would be enough, I’d love to hear more about why—what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI or AI-PONR instead of TAI, since I don’t really care about TAI)
I didn’t ask how much, I asked what does it even mean. I think I understand the principles of Cotra’s report. What I don’t understand is why should we believe the “neural anchor” when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don’t understand how Carl’s and Mark’s arguments in this thread protect the neural anchor from Yudkowsky’s criticism.)
These are three separate things:
(a) What is the meaning of “2020-FLOPS-equivalent that TAI needs?”
(b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs?
(c) Why should we believe the “neural anchor?”
(a) is answered roughly in my linked post and in much more detail and rigor in Ajeya’s doc.
(b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance.
For (c), I actually don’t think we should put that much weight on the “neural anchor,” and I don’t think Ajeya’s framework requires that we do (although, it’s true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn’t put so much weight on.) That said, I think it’s a reasonable anchor to use, even if it’s not where all of our weight should go. This post gives some of my intuitions about this. Of course Ajeya’s report says a lot more.
The chess link maybe should go to hippke’s work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning’s future trajectory.
If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can’t conclude that hardware is the dominant driver of progress by noticing that the cost of compute is dropping rapidly.
But new algorithms also don’t work well on old hardware. That’s evidence in favor of Paul’s view that much software work is adapting to exploit new hardware scales.
Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn’t have enough RAM to load it.
I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.
Here are the graphs from Hippke (he or I should publish summary at some point, sorry).
I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute.
So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software.
But instead of asking “how well does hardware/software progress help you get to 1995 performance?” you could ask “how well does hardware/software progress get you to 2015 performance?” and on that metric it looks like software progress is way more important because you basically just can’t scale old algorithms up to modern performance.
The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader.
ETA: but note that the ratio would be much more extreme for Deep Blue, which is another reasonable analogy you might use.
Yeah, the nonlinearity means it’s hard to know what question to ask.
If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I’m totally ignoring constants here), and we assume that compute = et so that conveniently log(compute)=t, thenddtElo=1t+1 . The first term is from compute and the second from software. And so our history is totally not scale-free! There’s some natural timescale set by t=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by software.
Though maybe I shouldn’t spend so much time guessing at the phenomenology of chess, and different problems will have different scaling behavior :P I think this is the case for text models and things like the Winograd schema challenges.
(I’m trying to answer and clarify some of the points in the comments based on my interpretation of Yudkowsky in this post. So take the interpretations with a grain of salt, not as “exactly what Yudkowsky meant”)
My summary of what you’re defending here: because hardware progress is (according to you) the major driver of AI innovation, then we should invest a lot of our forecasting resources into forecasting it, and we should leverage it as the strongest source of evidence available for thinking about AGI timelines.
I feel like this is not in contradiction with what Yudkowsky wrote in this post? I doubt he agrees that just additional compute is the main driver of progress (after all, the Bitter Lesson mostly tells you that insights and innovations leveraging more compute will beat hardcorded ones), but insofar as he expect us to have next to no knowledge of how to build AGI until around 2 years before it is done (and then only for those with the Thelian secret), then compute is indeed the next best thing that we have to estimate timelines.
Yet Yudkowsky’s point is that being the next best thing doesn’t mean it’s any good.
Evolution being an upper bound makes sense, and I think Yudkowsky agrees. But it’s an upper bound on the whole human optimization process, and the search space of the human optimization is tricky to think about. I see much of Yudkowsky’s criticisms of biological estimates here as saying “this biological anchor doesn’t express the cost of evolution’s optimization in terms of human optimization, but instead goes for a proxy which doesn’t tell you anything”.
So if someone captured both evolution and human optimization in the same search space, and found an upper bound on the cost (in terms of optimization power) that evolution spent to find humans, then I expect Yudkowsky would agree that this is an upper bound for the optimization power that human will use. But he might still retort that translating optimization power into compute is not obvious.
Okay, I’m going to propose what I think is the chain of arguments you’re using here:
Currently, we can train what sounds like the compute equivalent of insect brains, and yet we don’t have AGI. Hence we’re not currently able to build AGI with “smaller-than-insect brains”, which means AGI is less likely to be created with “smaller-than-insect brains”.
I agree that we don’t have AGI
The “compute equivalent” stuff is difficult, as I mentioned above, but I don’t think this is the main issue here.
Going from “we don’t know how to do that now” to “we should expect that it is not how we will do it” doesn’t really work IMO. As Yudkowsky points out, the requirements for AGI are constantly dropping, and maybe a new insight will turn out to make smaller neural nets far more powerful, before the bigger models reach AGI
Evolution created insect-sized brains and they were clearly not AGI, so we have evidence against AGI with that amount of resources.
Here the fact that evolution is far worse an optimizer than humans breaks most of the connection between evolution creating insects and humans creating AGI. Evolution merely shows that insects can be made with insect-sized brains, not that AGI cannot be extracted by better use of the same resources.
From my perspective this is exactly what Yudkowsky is arguing against in this post: it’s not because you know of a bunch of paths through search space that you know what a cleverer optimizer could find. There are ways to use a bunch of paths as data to understand the search space, but you then need either to argue that they are somehow dense in the search space, or that the sort of paths you’re interested in look similar to this bunch of paths. And at the moment, I don’t see an argument in any of these forms.
By default we should expect AGI to have a decent minimal size because of it’s complexity, hence smaller models have a lower credence.
Agree with the principle (sounds improbable that AGI will be made in 10 lines of LISP), but the threshold is where most of the difficulty lies: how much is too little? A 100 neurons sounds clearly too small, but when you reach insect-sized brains, it’s not obvious (at least to me) that better use of resources couldn’t bring you most of the way to AGI.
(I wonder if there’s an availability bias here where the only good models we have nowadays are huge, hence we expect that AGI must be a huge model?)
I think this is where the crux of can the current paradigm just scale matters a lot. The main point Yudkowsky uses in the dialogue to argue against your concentration of probability mass is that he doesn’t agree that deep learning scales that way to AGI. In his view (on which I’m not clear yet, and that’s not a view that I’ve seen anyone who actually studies LMs have), the increase in performance will break before. And as such, the concentration of probability mass shouldn’t happen, because the fact that you can reach the anchor is irrelevant since we don’t know a way to turn compute into AGI (according to Yudkowsky’s view).
Here too, it depends on transforming the optimization power of evolution into compute and other requirements, and then know how this compute is supposed to get transformed into efficiency and AGI. (That being said, I think Yudkowsky agrees with the conclusion, just not that specific way of reaching it).
Not clear to me what you mean here (might be clearer with the right link to the section of Cotra’s report about this). But note that based on Yudkowsky’s model in this post, the cost to make AGI should continue to drop as long as the world doesn’t end, which creates a weird situation where the probability of AGI keeps increasing with time (Not sure how to turn that into a distribution though...)
This is in the same spirit as a bunch of comments on this post, and I feel like it’s missing the point of the post? Like, it’s not about Moravec’s estimate being wildly wrong, it’s about the unsoundedness of the methods by which Moravec reaches his conclusion. Your analysis doesn’t give such evidence for Moravec predicting accuracy that we should expect he has a really strong method that just looks bad to Yudkowksy but is actually sound. And I feel points like that don’t go at all for the cruxes (the soundness of the method), instead they mostly correct a “too harsh judgment” by Yudkowsky, without invalidating his points.
Hum, my impression here is that Yudkowsky is actually arguing that he is modeling AGI timelines that way; and if you don’t add unwarranted assumptions and don’t misuse the analogies to biological anchors, then you get his model, which is completely unable to give the sort of answer Cotra’s model is outputting.
Or said differently, I expect that Yudkowsky thinks that if you reason correctly and only use actual evidence instead of unsound lines of reasoning, you get his model; but doing that in the explicit context of biological anchors is like trying to quit sugar in a sweetshop: the whole setting just makes that far harder. And given that he expects that he get the right constraints on models without the biological anchors stuff, then it’s completely redundant AND unhelpful.