Software engineer and repeat startup founder; best known for Writely (aka Google Docs). Now starting https://www.aisoup.org to foster constructive expert conversations about open questions in AI and AI policy, and posting at https://amistrongeryet.substack.com and https://x.com/snewmanpv.
snewman
Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl’s law for just the automation itself!
I’m having trouble parsing this sentence… which may not be important – the rest of what you’ve said seems clear, so unless there’s a separate idea here that needs responding to then it’s fine.
It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).
Or perhaps your objection is that you think there will be a smaller AI R&D multiplier for superhuman coders. (But this isn’t relevant once you hit full automation!)
Agreed that these two statements do a fairly good job of characterizing my objection. I think the discussion is somewhat confused by the term “AI researcher”. Presumably, for an SAR to accelerate R&D by 25x, “AI researcher” needs to cover nearly all human activities that go into AI R&D? And even more so for SAIR/250x. While I’ve never worked at an AI lab, I presume that the full set of activities involved in producing better models is pretty broad, with tails extending into domains pretty far from the subject matter of an ML Ph.D and sometimes carried out by people whose job titles and career paths bear no resemblance to “AI researcher”. Is that a fair statement?
If “producing better models” (AI R&D) requires more than just narrow “AI research” skills, then either SAR and SAIR need to be defined to cover that broader skill set (in which case, yes, I’d argue that 1.5-10 years is unreasonably short for unaccelerated SC->SAR), or if we stick with narrower definitions for SAR and SAIR then, yes, I’d argue for smaller multipliers.
This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?
My mental model is that, for some time to come, there will be activities where AIs simply aren’t very competent at all, such that even many copies running at high speed won’t provide uplift. For instance, if AIs aren’t in general able to make good choices regarding which experiments to run next, then even an army of very fast poor-experiment-choosers might not be worth much, we might still need to rely on people to choose experiments. Or if AIs aren’t much good at evaluating strategic business plans, it might be hard to train AIs to be better at running a business (a component of the SAIR → ASI transition) without relying on human input for that task.
For Amdah’s Law purposes, I’ve been shorthanding “incompetent AIs that don’t become useful for a task even when taking speed + scale into account” as “AI doesn’t provide uplift for that task”.
EDIT: of course, in practice it’s generally at least somewhat possible to trade speed+scale for quality, e.g. using consensus algorithms, or generate-and-test if you have a good way of identifying the best output. So a further refinement is to say that very high acceleration requires us to assume that this does not reach importantly diminishing returns in a significant set of activities.
EDIT2:
(My sense is that the progress multipliers in AI 2027 are too high but also that the human-only times between milestones are somewhat too long. On net, this makes me expect somewhat slower takeoff with a substantial chance on much slower takeoff.)
I find this quite plausible.
Yes, but you’re assuming that human-driven AI R&D is very highly bottlenecked on a single, highly serial task, which is simply not the case. (If you disagree: which specific narrow activity are you referring to that constitutes the non-parallelizable bottleneck?)
Amdahl’s Law isn’t just a bit of math, it’s a bit of math coupled with long experience of how complex systems tend to decompose in practice.
That’s not how the math works. Suppose there are 200 activities under the heading of “AI R&D” that each comprise at least 0.1% of the workload. Suppose we reach a point where AI is vastly superhuman at 150 of those activities (which would include any activities that humans are particularly bad at), moderately superhuman at 40 more, and not much better than human (or even worse than human) at the remaining 10. Those 10 activities where AI is not providing much uplift comprise at least 1% of the AI R&D workload, and so progress can be accelerated at most 100x.
This is oversimplified; there is some room for superhuman ability (making excellent choices of experiments to run) can compensate for lack of uplift in other areas (time to code and execute individual experiments). But the fundamental point remains: a complex process can be bottlenecked by its slowest step. Amdahl’s Law is not symmetric – a chain can’t be as strong as its strongest link.
AI 2027 is a Bet Against Amdahl’s Law
You omitted “with a straight face”. I do not believe that the scenario you’ve described is plausible (in the timeframe where we don’t already have ASI by other means, i.e. as a path to ASI rather than a ramification of it).
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Just posting to express my appreciation for the rich discussion. I see two broad topics emerging that seem worthy of systematic exploration:
What does a world look like in which AI is accelerating the productivity of a team of knowledge workers by 2x? 10x? 50x? In each scenario, how is the team interacting with the AIs, what capabilities would the AIs need, what strengths would the person need? How do junior and senior team members fit into this transition? For what sorts of work would this work well / poorly?
Validate this model against current practice, e.g. the ratio of junior vs. senior staff in effective organizations and how work is distributed across seniority.
How does this play out specifically for AI R&D?
Revisiting the questions from item 1.
How does increased R&D team productivity affect progress: to what extent is compute a bottleneck, how could the R&D organization adjust activities in response to reduced cost of labor relative to compute, does this open an opportunity to explore more points in the architecture space, etc.
(This is just a very brief sketch of the questions to be explored.)
I’m planning to review the entire discussion here and try to distill it into an early exploration of these questions, which I’ll then post, probably later this month.
better capabilities than average adult human in almost all respects in late 2024
I see people say things like this, but I don’t understand it at all. The average adult human can do all sorts of things that current AIs are hopeless at, such as planning a weekend getaway. Have you, literally you personally today, automated 90% of the things you do at your computer? If current AI has better capabilities than the average adult human, shouldn’t it be able to do most of what you do? (Setting aside anything where you have special expertise, but we all spend big chunks of our day doing things where we don’t have special expertise – replying to routine emails, for instance.)
FWIW, I touched on this in a recent blog post: https://amistrongeryet.substack.com/p/speed-and-distance.
Thanks for engaging so deeply on this!
AIs don’t just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration via researchers just focusing on the first line of research (depending on how diminishing returns go).
Good point, this would have some impact.
As an intution pump, imagine you had nearly free junior hires who run 10x faster, but also work all hours. Because they are free, you can run tons of copies. I think this could pretty plausibly speed things up by 10x.
Wouldn’t you drown in the overhead of generating tasks, evaluating the results, etc.? As a senior dev, I’ve had plenty of situations where junior devs were very helpful, but I’ve also had plenty of situations where it was more work for me to manage them than it would have been to do the job myself. These weren’t incompetent people, they just didn’t understand the situation well enough to make good choices and it wasn’t easy to impart that understanding. And I don’t think I’ve ever been sole tech lead for a team that was overall more than, say, 5x more productive than I am on my own – even when many of the people on the team were quite senior themselves. I can’t imagine trying to farm out enough work to achieve 10x of my personal productivity. There’s only so much you can delegate unless the system you’re delegating to has the sort of taste, judgement, and contextual awareness that a junior hire more or less by definition does not. Also you might run into the issue I mentioned where the senior person in the center of all this is no longer getting their hands dirty enough to collect the input needed to drive their high-level intuition and do their high-value senior things.
Hmm, I suppose it’s possible that AI R&D has a different flavor than what I’m used to. The software projects I’ve spent my career on are usually not very experimental in nature; the goal is generally not to learn whether an idea shows promise, it’s to design and implement code to implement a feature spec, for integration into the production system. If a junior dev does a so-so job, I have to work with them to bring it up to a higher standard, because we don’t want to incur the tech debt of integrating so-so code, we’d be paying for it for years. Maybe that plays out differently in AI R&D?
Incidentally, in this scenario, do you actually get to 10x the productivity of all your staff? Or do you just get to fire your junior staff? Seems like that depends on the distribution of staff levels today and on whether, in this world, junior staff can step up and productively manage AIs themselves.
Suppose a company specifically trained an AI system to be very familiar with its code base and infrastructure and relatively good at doing experiments for it. Then, it seems plausible that (with some misc schlep) the only needed context would be project specific context. …
These are fascinating questions but beyond what I think I can usefully contribute to in the format of a discussion thread. I might reach out at some point to see whether you’re open to discussing further. Ultimately I’m interested in developing a somewhat detailed model, with well-identified variables / assumptions that can be tested against reality.
I see a bunch of good questions explicitly or implicitly posed here. I’ll touch on each one.
1. What level of capabilities would be needed to achieve “AIs that 10x AI R&D labor”? My guess is, pretty high. Obviously you’d need to be able to automate at least 90% of what capabilities researchers do today. But 90% is a lot, you’ll be pushing out into the long tail of tasks that require taste, subtle tacit knowledge, etc. I am handicapped here by having absolutely no experience with / exposure to what goes on inside an AI research lab. I have 35 years of experience as a software engineer but precisely zero experience working on AI. So on this question I somewhat defer to folks like you. But I do suspect there is a tendency to underestimate how difficult / annoying these tail effects will be, this is the same fundamental principle as Hofstadter’s Law, the Programmer’s Credo, etc.
I have a personal suspicion that a surprisingly large fraction of work (possibly but not necessarily limited to “knowledge work”) will turn out to be “AGI complete”, meaning that it will require something approaching full AGI to undertake it at human level. But I haven’t really developed this idea beyond an intuition. It’s a crux and I would like to find a way to develop / articulate it further.
2. What does it even mean to accelerate someone’s work by 10x? It may be that if your experts are no longer doing any grunt work, they are no longer getting the input they need to do the parts of their job that are hardest to automate and/or where they’re really adding magic-sauce value. Or there may be other sources of friction / loss. In some cases it may over time be possible to find adaptations, in other cases it may be a more fundamental issue. (A possible counterbalance: if AIs can become highly superhuman at some aspects of the job, not just in speed/cost but in quality of output, that could compensate for delivering a less-than-10x time speedup on the overall workflow.)
3. If AIs that 10x AI R&D labor are 20% likely to arrive and be adopted by Jan 2027, would that update my view on the possibility of AGI-as-I-defined-it by 2030? It would, because (per the above) I think that delivering that 10x productivity boost would require something pretty close to AGI. In other words, conditional on AI R&D labor being accelerated 10x by Jan 2027, I would expect that we have something close to AGI by Jan 2027, which also implies that we were able to make huge advances in capabilities in 24 months. Whereas I think your model is that we could get that level of productivity boost from something well short of AGI.
If it turns out that we get 10x AI R&D labor by Jan 2027 but the AIs that enabled this are pretty far from AGI… then my world model is very confused and I can’t predict how I would update, I’d need to know more about how that worked out. I suppose it would probably push me toward shorter timelines, because it would suggest that “almost all work is easy” and RSI starts to really kick in earlier than my expectation.
4. Is this 10x milestone achievable just by scaling up existing approaches? My intuition is no. I think that milestone requires very capable AI (items 1+2 in this list). And I don’t see current approaches delivering much progress on things I think will be needed for such capability, such as long-term memory, continuous learning, ability to “break out of the chatbox” and deal with open-ended information sources and extraneous information, or other factors that I mentioned in the original post.
I am very interested in discussing any or all of these questions further.
See my response to Daniel (https://www.lesswrong.com/posts/auGYErf5QqiTihTsJ/what-indicators-should-we-watch-to-disambiguate-agi?commentId=WRJMsp2bZCBp5egvr). In brief: I won’t defend my vague characterization of “breakthroughs” nor my handwavy estimates of how how many are needed to reach AGI, how often they occur, and how the rate of breakthroughs might evolve. I would love to see someone attempt a more rigorous analysis along these lines (I don’t feel particularly qualified to do so). I wouldn’t expect that to result in a precise figure for the arrival of AGI, but I would hope for it to add to the conversation.
This is my “slow scenario”. Not sure whether it’s clear that I meant the things I said here to lean pessimistic – I struggled with whether to clutter each scenario with a lot of “might” and “if things go quickly / slowly” and so forth.
In any case, you are absolutely correct that I am handwaving here, independent of whether I am attempting to wave in the general direction of my median prediction or something else. The same is true in other places, for instance when I argue that even in what I am dubbing a “fast scenario” AGI (as defined here) is at least four years away. Perhaps I should have added additional qualifiers in the handful of places where I mention specific calendar timelines.
What I am primarily hoping to contribute is a focus on specific(ish) qualitative changes that (I argue) will need to emerge in AI capabilities along the path to AGI. A lot of the discourse seems to treat capabilities as a scalar, one-dimensional variable, with the implication that we can project timelines by measuring the rate of increase in that variable. At this point I don’t think that’s the best framing, or at least not the only useful framing.
One hope I have is that others can step in and help construct better-grounded estimates on things I’m gesturing at, such as how many “breakthroughs” (a term I have notably not attempted to define) would be needed to reach AGI and how many we might expect per year. But I’d be satisfied if my only contribution would be that people start talking a bit less about benchmark scores and a bit more about the indicators I list toward the end of the post – or, even better, some improved set of indicators.
What Indicators Should We Watch to Disambiguate AGI Timelines?
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn’t expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as “the new scaling law”, with the implication that it will pick up right where scaling of training compute left off, just keep turning the dial and you’ll keep getting better results. I think the idea that there is no wall, everything is going to continue just as it was except now the compute scaling happens on the inference side, is exaggerated.
Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a “new scaling law”. Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inference compute, for problems that are difficult enough to call for it, and more so as efficiency gains reduce the cost. But given the log-linear nature of the scaling law, and the inability to amortize, I don’t think we’ll see the many-order-of-magnitude journey that we’ve seen for training compute.
As others have said, what we should presumably expect from o4, o5, etc. is that they’ll make better use of a given amount of compute (and/or be able to throw compute at a broader range of problems), not that they’ll primarily be about pushing farther up that log-linear graph.
Of course in the domain of natural intelligence, it is sometimes worth having a person go off and spend a full day on a problem, or even have a large team spend several years on a high-level problem. In other words, to spend lots of inference-time compute on a single high-level task. I have not tried to wrap my head around how that relates to scaling of inference-time compute. Is the relationship between the performance of a team on a task, and the number of person-days the team has to spend, log-linear???
I’d participate.
I love this. Strong upvoted. I wonder if there’s a “silent majority” of folks who would tend to post (and upvote) reasonable things, but don’t bother because “everyone knows there’s no point in trying to have a civil discussion on Twitter”.
Might there be a bit of a collective action problem here? Like, we need a critical mass of reasonable people participating in the discussion so that reasonable participation gets engagement and thus the reasonable people are motivated to continue? I wonder what might be done about that.
I think we’re saying the same thing? “The LLM being given less information [about the internal state of the actor it is imitating]” and “the LLM needs to maintain a probability distribution over possible internal states of the actor it is imitating” seem pretty equivalent.
This is valid, but doesn’t really engage with the specific arguments here. By definition, when we consider the potential for AI to accelerate the path to ASI, we are contemplating the capabilities of something that is not a full ASI. Today’s models have extremely jagged capabilities, with lots of holes, and (I would argue) they aren’t anywhere near exhibiting sophisticated high-level planning skills able to route around their own limitations. So the question becomes, what is the shape of the curve of AI filling in weak capabilities and/or developing sophisticated strategies for routing around those weaknesses?
This is exactly missing the point. Training a cutting-edge model today involves a broad range of activities, not all of which fall under the heading of “discovering technologies” or “improving algorithms” or whatever. I am arguing that if all you can do is find better algorithms rapidly, that’s valuable but it’s not going to speed up overall progress by very large factors. Also, it may be that “by training very small models very quickly”, the AI would discover new technologies that improve some aspects of models but fail to advance some other important aspects.