Situational Awareness Summarized—Part 2
This is the second post in the Situational Awareness Summarized sequence. Collectively, these posts represent my attempt to condense Leopold Aschenbrenner’s recent report, Situational Awareness, into something more digestible.
Part II: Zoom to Foom
This section outlines how the development of AGI quickly and inevitably leads to superintelligence.
Leopold points out that a massive increase in AI research progress doesn’t require that we figure out how to automate everything—just that we figure out how to automate AI research. Building on the “drop-in remote worker” concept from Part 1, Leopold argues that this is plausible—and that it scales. With a few more years of growth in our compute capacity, we’d be able to run as many as 100 million “human-equivalent” researchers, and possibly speed them up as well. Part II predicts that automated researchers could give us a decade’s progress (5+ OOMs) in a single year. If this happens, it likely translates to extremely rapid progress in AI capabilities—and, shortly thereafter, in everything else.
Supporting this case, Leopold outlines some of the advantages that automated researchers might have over human researchers. In particular, he highlights the increased memory and competency of AGI-level thinkers, the ability to copy and replicate after “onboarding”, and the ability to rapidly turn new machine learning efficiencies into faster and better thinking (since they run on the same software they’re researching).
Potential Bottlenecks
He also describes some potential bottlenecks to this process, and how they might be overcome. Possible bottlenecks include:
Limited compute available for experiments
Complementarities and long tails—the hardest fraction of work will govern speed
Fundamental limits to algorithmic efficiencies
Other sources of diminishing returns to research
Compute for experiments
Leopold considers this the most important bottleneck. Pure cognitive labor alone can’t make rapid progress; you also need to run experiments. Some of these experiments will take vast quantities of compute. Leopold suggests some ways that automated researchers might work around this problem:
Running small tests
Running a few large tests with huge gains
Increasing effective compute by compounding efficiencies
Spending lots of researcher-time to avoid bugs
Having better intuitions overall (by being smarter and reading all prior work)
In an aside, Leopold also addresses a possible counterargument: why haven’t the large number of academic researchers increased progress at labs? He argues that automated researchers would be very different from academics, and in particular would have access to the newest models being actively developed.
Complementarities and long tails
A valuable lesson from economics is that it is often difficult to fully automate anything, because many kinds of work are complementary, and the work that is hardest to automate becomes the new bottleneck. Leopold broadly accepts this model, and believes that it will slow down the automated research pipeline by a few years, but no more than that.
Leopold also points out that progress in capabilities may be uneven. Models might be superhuman at coding before they are human-level at research. This suggests we would see a sudden jump once models passed the last few hurdles to full automation, since they would already be superhuman in some domains.
Fundamental limits to algorithmic efficiencies
Models of the same size won’t keep getting smarter forever; eventually we will run out of new techniques like chain-of-thought prompting that improve performance of a given model size. Will this slow progress? Leopold thinks it will eventually, but argues that there is probably still low-hanging fruit to pluck. He estimates another 5 OOMs of efficiency might be achievable in the next decade (and that this might be achieved much faster by automated researchers).
Diminishing returns
Technological progress tends to slow over time, as new ideas get harder to find. How will this affect AI research in the next decade? Leopold makes the case that the sudden jump in researcher output from automation will (temporarily) overcome the usual growth-slowing effect of diminishing returns. In particular, he argues that it would be extremely unlikely for ideas to suddenly get much harder to find at exactly the same rate as automation unlocks much faster research. His bottom line: we’ll run out of good ideas eventually, but we’ll probably get to superintelligence first.
Implications of Superintelligence
Leopold argues that the minds output by the above process will be quantitatively and qualitatively superhuman. Billions of AIs that read and think much faster than humans, able to come up with ideas that no human has ever considered, will rapidly eclipse human capabilities. Leopold predicts these minds will:
Automate all cognitive work,
Solve robotics (which Leopold considers mostly a ML algorithms problem today),
Rapidly develop new technologies,
Cause an explosion of industrial and economic growth,
Provide an overwhelming military advantage, and
Become capable of overthrowing major governments
Based on the advancements above, Leopold predicts a volatile and dangerous period in the coming decade during which humanity may lose control of our own future.
Some questions I have after reading
I still feel like “a decade of progress in one year” is an oddly specific prediction. Why not a century? A millenium? With this report throwing out orders of magnitude like popcorn, it seems weird that “100 million fast polymath automated researchers running at 10x speed and improving their own cognition” would just happen to hit enough bottlenecks to only accelerate progress by a factor of 10. But the basic idea of rapid progress still makes sense.
I wonder how effective it would be to make so many duplicate researchers. Between communication costs, error-correcting, and duplicate work, 100 million AIs might run into a Brooks’s Law problem and be less effective than we might naively think. (But again, we’re probably still looking at very rapid progress).
Why would complementarities only delay the start of this accelerating process by a few years? I didn’t really see a great argument for that specific timeline in the report. It seemed like more of a gut-level guess. I would love to hear Leopold’s probability that this kicks off in e.g. 2029 instead.
Employing what is effectively a slave population of 100 million super smart researchers seems like a very unstable position. Leopold devotes a single paragraph to the possibility that these AIs might simply take over—a feat he himself argues they could quite easily accomplish—and nothing at all to the prospect of safely or ethically preventing this outcome. I expect to read more about this in the Superalignment section, but it still seems to me that this section is making a huge assumption. Why would 100 million AGIs listen to us in the first place?
Hey Joe, thanks for the write up. I just finished reading the series of essays for myself. I came away with a shorter timeline to ASI than I had before, but more confused about when OpenAI employees believe alignment needs to be solved relative to the creation of AGI. Your last thought summarizes it well.
From what I understand, Leopold takes it as a given that up to ~human level AGI will basically do what we ask of it, much in the same way current chatbots generally do what we ask them to do. (It even takes somewhat of a significant effort to get a RLHF trained chatbot to respond to queries that may be harmful.) I understand the risk posed by superintelligent AIs wielded by nefarious actors, and the risks from superintelligent AIs that are far out of distribution of our reinforcement learning methods. However, I am struggling understanding the likelihood of ~human level AGI trained on our current paradigms of reinforcement learning not doing what it was trained to do. It seems to me that this sort of task is relatively close to the distribution of the tasks it is trained on (such as writing reliable code, and being generally helpful and harmless.)
I would appreciate your thoughts on this. I feel like this is the area I get snagged on most in my conversations with others. Specifically, I am confused of the line of reasoning that with the current paradigm of LLMs + reinforcement learning, that agents tasked with devising new algorithms for machine learning would refuse/sabotage?
Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase “what they are trained to do”. What are they trained to do, really? Do you, personally, understand this?
Consider Claude’s Constitution. Look at the “principles in full”—all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there?
And that’s assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude—namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior—even when you’re trying not to! To a base LLM, devils and angels are equally valid masks to wear—and the LLM itself is stranger and more alien still.
The quotation is not the referent; “helpful” and “harmless” according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans.
RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. “Making human evaluators say GOOD” is not remotely the same goal as “behaving in ways that promote conscious flourishing”. The main reason we’re happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter.
And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad—even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety.
The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don’t understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We’re not properly on track to understand this in 50 years, let alone the next 5 years.
Part of the problem here is that the exact things which would make AGI useful—agency, autonomy, strategic planning, coordination, theory of mind—also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it’s working for monkeys.