Take 14: Corrigibility isn’t that great.
As a writing exercise, I’m writing an AI Alignment Hot Take Advent Calendar—one new hot take, written every day some days for 25 days.
It’s the end (I saved a tenuous one for ya’)! Kind of disappointing that this ended up averaging out to one every 2 days, but this was also a lot of work and I’m happy with the quality level. Some of the drafts that didn’t work as “hot takes” will get published later.
I
There are certainly arguments for why we want to build corrigible AI. For example, the problem of fully updated deference says that if you build an AI that wants things, even if it’s uncertain about what it wants, it knows it can get more of what it wants if it doesn’t let you turn it off.
The metal image this conjures up is of an AI doing something that’s obvious-to-humans bad, and us clamoring to stop, but it blocking us from turning it off because we didn’t solve the problem of fully updated deference. It would be better if we built an AI that took things slow, and that would let us shut it off if we got to look at what it was doing and saw that it was obviously bad.
Don’t get me wrong, this could be a nice property to have. But I don’t think it’s all that likely to come up, because aiming at aligned AI means building AI that tries not to do obviously bad stuff.
A key point is that corrigibility is only desirable if you actually expect to use it. Its primary sales pitch is that it might give us a mulligan on an AI that starts doing obviously bad stuff. If everything goes great and we wind up in a post-scarcity utopia, I’m not worried about whether the AI would let me turn it off if I counterfactually wanted to.
A world where corrigibility is useful might look like us building an agenty AI with a value learning process that we’re not confident in, letting it run and interacting with it to try to judge how the value learning is going, and then (with moderate probability) turning it off and trying again with another idea for value learning. What does corrigibility have to do in this world? The AI shouldn’t deliberately try to get shut down by doing obviously-bad things, but it also shouldn’t try to avoid being shut down by instrumentally hiding bad behavior, or by backing itself up on AWS.
Such indifference to the outside world is the default for limited AI that doesn’t model that part of the world, or doesn’t make decisions in a very coherent way. But in an agent that’s good at navigating the real world, a lot of corrigibility is made out of value learning. The AI probably has to actively notice when it’s coming into conflict with humans (and specifically humans, rather than head lice) and defer to them, even if those humans want to shut down the AI or rewrite its value learning process.
So the first issue: if you can already do things like noticing when you’re coming into conflict with humans, I fully expect you can build an AI that tries not to do things the humans think are obviously bad. And even though this has dangers, notably making corrigibility less likely to be used by making AIs avoid doing obviously-bad things, what the hell are you trying to do value learning for if you’re not going to use it to get the AI to do good things and not bad things?
II
Second issue: sometimes agenty properties are good. An incorrigible AI is one that endorses some value learning process or meta-process, and will defend that good process against random noise, and against humans who might try to modify the process selfishly or short-sightedly.
The point of corrigibility is that it the AI should not trust its own judgement about what counts as “short-sighted” for the human, and should let itself be shut down or modified. But sometimes humans are like a toddler in a self-driving car, and you don’t want the car to listen when they press the emergency stop button. And more vaguely, I don’t want corrigibility’s unnaturalness to leak out and interfere with a super-powerful AI protecting what it finds good.
Maybe there’s a fine line we can tread here where there’s some parameter for how much the AI protects its goals that changes as we gain trust in the AI’s reasoning process, but it seems plausible that corrigibility creates more problems than it solves for the future when we’re pretty confident in the value learning process.
I’m not saying we can’t test things. If we want to test an AI’s value learning process without having problems with creating an adversarial agent, then the safest way is to not create an agent at all—just directly test a generative world-model, or a plan generator that’s not hooked up to anything, or what have you. In many ways, this is corrigibility, just an extreme form that makes the AI useless for deployment.
When we actually build any superintelligent agent, I’d rather that we just have a value-learning process that we trust. One that not only doesn’t do obviously bad things, but goes so far as to not do obviously bad meta-level reasoning either. It’s been speculated that a superintelligent AI would reinvent corrigibility so it could give it to its successor AIs. I bet a superintelligent AI would just solve value learning instead.
Corrigibility isn’t incompatible with usually refusing to shut down. It’s the opposite of wrapper-mindedness, not the opposite of agency. The kind of agent that’s good at escalating concerns about its fundamental optimization tendencies can still be corrigible. A more capable corrigible agent won’t shut down, it’d fix itself instead (with shutting down being a weird special case of fixing itself). A less capable corrigible agent has to shut down for maintenance by others.
Strawberry alignment does want shutdown as a basic building block. In the absence of a solution to ambitious alignment (channeling correct preferences-on-reflection), corrigible non-agents have utility for managing acute AI risk period. A strawberry aligned AI also shouldn’t pursue a subtask of fixing problems in its own cognition, other than by opening itself up for maintenance, since that requires more dangerous cognition than whatever object level problem posed to it. (Even an agent that’s corrigible might recognize absence of solution to ambitious alignment as a reason to shut down for good, since nobody knows how to fix it. But that’s less clear. Narrower goodhart scope on a task might also turn agency into non-agency in the strawberry alignment sense.)
So I think the argument in this post is more against pursuit of strawberry alignment (non-agency where corrigibility involves shutdown), than against pursuit of corrigibility, though it doesn’t fit either framing very well.
It’s unclear how much of what you’re describing is “corrigibility,” and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it’s not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it’s better than humans at understanding the world.
But this sort of sophisticated-value-learning reasoning doesn’t help you if the AI is still flawed once it’s better than humans at understanding the world. Hence why I file it more under “just be good at value learning rather than bad at it” rather than under “corrigibility.” If you want guarantees about being able to shut down an AI, it’s no help to you if those guarantees hold only when the AI is already doing a good job at using sophisticated value learning reasoning—I usually interpret corrigibility discussion as intended to give safety guarantees that help you even when alignment guarantees fail.
It’s like the humans want to have a safeword, where when the humans are serious enough about wanting the AI to shut down to use the safeword, the AI does it, even if it thinks that it knows better than the humans and the humans are making a horrible mistake.
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It’s less interesting when applied to things other than preference, but even when applied to preference it’s not just value learning.
There’s value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can’t be fully learned, otherwise the resulting agent won’t be aligned in the ambitious sense, its preference won’t be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely.
I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it’s about recognizing one’s own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it’s not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it less capable, until it improves.
For an agent, shutting down could mean shutting down more impactful/weird external actions (those further from goodhart scope), but continuing to act in more familiar situations and continuing to work on value learning and other changes in its cognition that it considers safe to work on. The share of work done by humans vs. a corrigible agent depends on who is better at safely fixing a given problem/limitation, not on whether the agent is better at understanding the world in a loose sense. Humans are not safe.