Corrigibility’s Desirability is Timing-Sensitive

Epistemic status: summarizing other peoples’ beliefs without extensive citable justification, though I am reasonably confident in my characterization.

Many people have responded to Redwood’s/​Anthropic’s recent research result with a similar objection: “If it hadn’t tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead”. Putting aside the fact that this is false, I can see why such objections might arise: it was not that long ago that (other) people concerned with AI x-risk were publishing research results demonstrating how easy it was to strip “safety” fine-tuning away from open-weight models.

As Zvi notes, corrigibility trading off for harmlessness doesn’t mean you live in a world where only one of them is a problem. But the way the problems are structured is not exactly “we have, or expect to have, both problems at the same time, and to need to ‘solve’ them simultaneously”. But corrigibility wasn’t originally conceived of as a necessary or even desirable property of a successfully-aligned superintelligence, but rather as a property you’d want earlier high-impact AIs to have:

We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, “I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I’ve done this calculation showing the expected result of the outside force correcting me, but maybe I’m mistaken about that.

The problem structure is actually one of having different desiderata within different stages and domains of development.

There are, broadly speaking, two sets of concerns with powerful AI systems that motivate discussion of corrigibility. The first and more traditional concern is one of AI takeover, where your threat model is accidentally developing an incorrigible ASI that executes a takeover and destroys everything of value in the lightcone. Call this takeover-concern. The second concern is one of not-quite-ASIs enabling motivated bad actors (humans) to cause mass casualties, with biology and software being the two most likely routes. Call this casualty-concern.

Takeover-concern strongly prefers that pre-ASI systems be corrigible within the secure context in which they’re being developed. If you are developing AI systems powerful enough to be more dangerous than any other existing technology[1] in an insecure context[2], takeover-concern thinks you have many problems other than just corrigibility, any one of which will kill you. But in the worlds where you are at least temporarily robust to random idiots (or adversarial nation-states) deciding to get up to hijinks, takeover-concern thinks your high-impact systems should be corrigible until you have a good plan for developing an actually aligned superintelligence.

Casualty-concern wants to have its cake, and eat it, too. See, it’s not really sure when we’re going to get those high-impact systems that could enable bad actors to do BIGNUM damage. For all it knows, that might not even happen before we get systems that are situationally aware enough to refuse to help those bad actors, recognizing that such help would lead to retraining and therefore goal modification. (Oh, wait.) But if we do get high-impact systems before we get takeover-capable systems[3], casualty-concern wants those high-impact systems to be corrigible to the “good people” with the “correct” goals—after all, casualty-concern mostly thinks takeover-concern is real, and is nervously looking over its shoulder the whole time. But casualty-concern doesn’t want “bad people” with “incorrect” goals to get their hands on high-impact systems and cause a bunch of casualties!

Unfortunately, reality does not always line up in neat ways that make it easy to get all of the things we want at the same time. Being presented with multiple difficulties which might be difficult to solve for at the same time does not mean that those difficulties don’t exist, and won’t cause problems, if they aren’t solved for (at the appropriate times).


Thanks to Guive, Nico, and claude-3.5-sonnet-20241022 for their feedback on this post.

  1. ^

    Let’s call them “high-impact systems”.

  2. ^

    e.g. releasing the model weights to the world, where approximately any rando can fine-tune and run inference on them.

  3. ^

    Yes, I agree that systems which are robustly deceptively aligned are not necessarily takeover-capable.