Against “argument from overhang risk”

Epistemic status: I wrote this in August 2023, got some feedback I didn’t manage to incorporate very well, and then never published it. There’s been less discussion of overhang risk recently but I don’t see any reason to keep sitting on it. Still broadly endorsed, though there’s a mention of a “recent” hardware shortage which might be a bit dated.


I think arguments about the risks of overhangs are often unclear about what type of argument is being made. Various types arguments that I’ve seen include:

  1. Pausing is net-harmful in expectation because it would cause an overhang, which [insert further argument here]

  2. Pausing is less helpful than [naive estimate of helpfulness] because it would cause an overhang, which [insert further argument here]

  3. We shouldn’t spend effort attempting to coordinate or enforce a pause, because it will cause [harmful PR effects]

  4. We shouldn’t spend effort attempting to coordinate or enforce a pause, because that effort can be directed toward [different goal], which is more helpful (conditional on proponent’s implicit model of the world)

I think it’s pretty difficult to reason about e.g. (4) without having good object-level models of the actual trade-offs involved for each proposed intervention, and so have been thinking about (1) and (2) recently. (I haven’t spent much time thinking about (3); while I’ve seen some arguments in that direction, most have been either obvious or trivially wrong, rather than wrong in a non-obvious way.)

I think (1) generally factors into two claims:

  1. Most alignment progress will happen from studying closer-to-superhuman models

  2. The “rebound” from overhangs after a pause will happen so much faster than the continuous progress that would’ve happened without a pause, that you end up behind on net in terms of alignment progress at the same level of capabilities

I want put aside the first claim and argue that the second is not obvious.

Heuristic

Arguments that overhangs are so bad that they outweigh the effects of pausing or slowing down are basically arguing that a second-order effect is more salient than the first-order effect. This is sometimes true, but before you’ve screened this consideration off by examining the object-level, I think your prior should be against.

Progress Begets Progress

I think an argument of the following form is not crazy: “All else equal, you might rather smoothly progress up the curve of capabilities from timestep 0 to timestep 10, gaining one capability point per timestep, than put a lid on progress at 0 and then rapidly gain 10 capability points between timesteps 9 to 10, because that gives you more time to work on (and with) more powerful models”.

But notice that this argument has several assumptions, both explicit and implicit:

  • You know that dangerous capabilities emerge when you have 10 capability points, rather than 9, or 7, or 2.

  • You would, in fact, gain 10 capability points in the last timestep, if you stopped larger training runs, rather than a smaller number.

  • You will learn substantially more valuable things from more powerful models. From another frame: you will learn anything at all that generalizes from the “not dangerous” capabilities regime to “dangerous” capabilities regime.

The last assumption is the most obvious, and is the source of significant disagreement between various parties. But I don’t think the claim that overhangs make pausing more-dangerous-than-not survives, even if you accept that assumption for the sake of argument.

First, a pause straightforwardly buys you time in many worlds where counterfactual (no-pause) timelines were shorter than the duration of the pause. The less likely you think it is that we reach ASI in the next n years, the less upside there is to an n-year pause.

Second, all else is not equal, and it seems pretty obvious that if we paused large training runs for some period of time, and then unpaused them, we should not (in expectation) see capabilities quickly and fully “catch up” to where they’d be in the counterfactual no-pause world. What are the relevant inputs for pushing the capabilities frontier?

  • Hardware (GPUs/​etc)

  • Algorithms

  • Supporting infrastructure (everything from “better tooling for scaling large ML training runs” to “having a sufficient concentration of very smart people in one place, working on the same problem”)

Approximately all relevant inputs into pushing the capabilities frontier would see less marginal demand if we instituted a pause. Many of them seem likely to have serial dependency trees in practice, such that large parts of projected overhangs may take nearly as long to traverse after a pause as they would have without a pause.

If we institute a pause, we should expect to see (counterfactually) reduced R&D investment in improving hardware capabilities, reduced investment in scaling hardware production, reduced hardware production, reduced investment in research, reduced investment in supporting infrastructure, and fewer people entering the field.

These are all bottlenecks. If it were the case that a pause only caused slowdown by suppressing a single input (i.e. hardware production), while everything else continued at the same speed, then I’d be much less surprised to see a sharp spike in capabilities after the end of a pause (though this depends substantially on which input is suppressed). To me, the recent hardware shortage is very strong evidence that we will not be surprised by a sharp jump in capabilities after a pause, as a result of the pause creating an overhang that eliminates all or nearly all bottlenecks to reaching ASI.

Also relevant is the actual amount of time you expect it to take to “eat” an overhang. Some researchers seem to be operating on a model where most of our progress on alignment will happen in a relatively short window of time before we reach ASI—maybe a few years at best. While I don’t think this is obviously true, it is much more likely to be true if you believe that we will “learn substantially more valuable things from more powerful models”[1].

If you believe that things will move quickly at the end, then an overhang seems like it’s most harmful if it takes substantially less time to eat the overhang than the counterfactual “slow takeoff” period. If there’s a floor on how long it takes to eat the overhang, and it’s comparable to or longer than the counterfactual “slow takeoff” period, then you don’t lose any of the precious wall clock time working with progressively more powerful models you need to successfully scale alignment. But given you already believe that the counterfactual “slow takeoff” period is not actually that slow (maybe on the order of a couple years), you need to think that the overhang will get eaten very quickly (on the order of months, or maybe a year on the outside) for you to be losing much time. As argued above, I don’t think we’re very likely to be able to eat any sort of meaningful overhang that quickly.

I haven’t spent much time thinking about situations where things move relatively slowly at the end, but I have a few main guesses for what world models might generate that belief:

  1. Sharply diminishing returns to intelligence

  2. No meaningful discontinuities in “output” from increased intelligence

  3. Incremental alignment of increasingly powerful models allows us to very strongly manage (and slow down) our climb up the capabilities curve

(1) seems implausible to me, but also suggests that risk is not that high in an absolute sense. If you believe this, I’m not sure why you’re particularly concerned about x-risk from AI.

(2) seems contradicted by basically all the empirical evidence we have available to us on the wildly discontinuous returns to intelligence even within the relatively narrow human span of intelligence; I don’t really know how to bridge this gap.

(3) doesn’t seem like it’d work unless you’re imagining dramatically more global governance/​coordination success than I am (and also makes a lot of assumptions about future alignment successes).

Putting those objections aside, (1) doesn’t seem like it should be particularly concerned with overhang risk (though it doesn’t see much benefit to pausing either). (2) and (3) do seem like they might be concerned by overhang risk; (3) depends strongly on careful management of capabilities growth.

Conclusion

To sum up, my understanding of the arguments against pausing suggests that they depend on an upstream belief that having enough “well-managed” wall clock time with progressively more powerful models is an important or necessary factor in succeeding at aligning ASI at the end of the day.

I argue that overhang is unlikely to be eaten so quickly that you lose time compared to how much time you’d otherwise have during a slow takeoff. I assert that a “very very slow /​ no take-off” world is implausible, without argument.

I may be misunderstanding or missing the best arguments for why overhangs should be first-order considerations when evaluating pause proposals. If I have, please do leave a comment.


Thanks to Drake Thomas for the substantial feedback.

  1. ^

    Remember that this belief suggests correspondingly more pessimistic views about the value of pausing. There may be arguments against pausing that don’t rely on this, but in practice, I observe that those concerned by AI x-risk who think that pausing is harmful or not very helpful on net tend to arrive at that belief because it’s strongly entailed by believing that we’d be bottlenecked on alignment progress without being able to work with progressively more powerful models.