Corrigibility
(Warning: rambling.)
I would like to build AI systems which help me:
Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on
We say an agent is corrigible (article on Arbital) if it has these properties. I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles; it has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense.
In this post I claim:
A benign act-based agent will be robustly corrigible if we want it to be.
A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes.
As a consequence, we shouldn’t think about alignment as a narrow target which we need to implement exactly and preserve precisely. We’re aiming for a broad basin, and trying to avoid problems that could kick out of that basin.
This view is an important part of my overall optimism about alignment, and an important background assumption in some of my writing.
1. Benign act-based agents can be corrigible
A benign agent optimizes in accordance with our preferences. An act-basedagent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible.
If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.
This kind of corrigibility doesn’t require any special machinery. An act-based agent turns off when the overseer presses the “off” button not because it has received new evidence, or because of delicately balanced incentives. It turns off because that’s what the overseer prefers.
Contrast with the usual futurist perspective
Omohundro’s The Basic AI Drives argues that “almost all systems [will] protect their utility functions from modification,” and Soares, Fallenstein, Yudkowsky, and Armstrong cite as: “almost all [rational] agents are instrumentally motivated to preserve their preferences.” This motivates them to consider modifications to an agent to remove this default incentive.
Act-based agents are generally an exception to these arguments, since the overseer has preferences about whether the agent protects its utility function from modification. Omohundro presents preferences-about-your-utility function case as a somewhat pathological exception, but I suspect that it will be the typical state of affairs for powerful AI (as for humans) and it does not appear to be unstable. It’s also very easy to implement in 2017.
Is act-based corrigibility robust?
How is corrigibility affected if an agent is ignorant or mistaken about the overseer’s preferences?
I think you don’t need particularly accurate models of a human’s preferences before you can predict that they want their robot to turn off when they press the off button or that they don’t want to be lied to.
In the concrete case of an approval-directed agent, “human preferences” are represented by human responses to questions of the form “how happy would you be if I did a?” If the agent is considering the action a precisely because it is manipulative or would thwart the user’s attempts to correct the system, then it doesn’t seem hard to predict that the overseer will object to a.
Eliezer has suggested that this is a very anthropocentric judgment of “easiness.” I don’t think that’s true — I think that given a description of a proposed course of action, the judgment “is agent X being misled?” is objectively a relatively easy prediction problem (compared to the complexity of generating a strategically deceptive course of action).
Fortunately this is the kind of thing that we will get a great deal of evidence about long in advance. Failing to predict the overseer becomes less likely as your agent becomes smarter, not more likely. So if in the near future we build systems that make good enough predictions to be corrigible, then we can expect their superintelligent successors to have the same ability.
(This discussion mostly applies on the training distribution and sets aside issues of robustness/reliability of the predictor itself, for which I think adversarial training is the most plausible solution. This issue will apply to any approach to corrigibility which involves machine learning, which I think includes any realistic approach.)
Is instrumental corrigibility robust?
If an agent shares the overseer’s long-term values and is corrigible instrumentally, a slight divergence in values would turn the agent and the overseer into adversaries and totally break corrigibility. This can also happen with a framework like CIRL — if the way the agent infers the overseer’s values is slightly different from what the overseer would conclude upon reflection (which seems quite likely when the agent’s model is misspecified, as it inevitably will be!) then we have a similar adversarial relationship.
2. Corrigible agents become more corrigible/aligned
In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).
But a corrigible agent prefers to build other agents that share the overseer’spreferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.
This is all a bit of a simplification (though I think it gives the right idea). In reality the space of possible errors and perturbations carves out a low degree manifold in the space of all possible minds. Undoubtedly there are “small” perturbations in the space of possible minds which would lead to the agent falling off the balance beam. The task is to parametrize our agents such that the manifold of likely-successors is restricted to the part of the space that looks more like a ravine. In the last section I argued that act-based agents accomplish this, and I’m sure there are alternative approaches.
Amplification
Corrigibility also protects us from gradual value drift during capability amplification. As we build more powerful compound agents, their values may effectively drift. But unless the drift is large enough to disrupt corrigibility, the compound agent will continue to attempt to correct and manage that drift.
This is an important part of my optimism about amplification. It’s what makes it coherent to talk about preserving benignity as an inductive invariant, even when “benign” appears to be such a slippery concept. It’s why it makes sense to talk about reliability and security as if being “benign” was a boolean property.
In all these cases I think that I should actually have been arguing for corrigibility rather than benignity. The robustness of corrigibility means that we can potentially get by with a good enough formalization, rather than needing to get it exactly right. The fact that corrigibility is a basin of attraction allows us to consider failures as discrete events rather than worrying about slight perturbations. And the fact that corrigibility eventually leads to aligned behavior means that if we could inductively establish corrigibility, then we’d be happy.
This is still not quite right and not at all formal, but hopefully it’s getting closer to my real reasons for optimism.
Conclusion
I think that many futurists are way too pessimistic about alignment. Part of that pessimism seems to stem from a view like “any false move leads to disaster.” While there are some kinds of mistakes that clearly do lead to disaster, I also think it is possible to build the kind of AI where probableperturbations or errors will be gracefully corrected. In this post I tried to informally flesh out my view. I don’t expect this to be completely convincing, but I hope that it can help my more pessimistic readers understand where I am coming from.
Postscript: the hard problem of corrigibility and the diff of my and Eliezer’s views
I share many of Eliezer’s intuitions regarding the “hard problem of corrigibility” (I assume that Eliezer wrote this article). Eliezer’s intuition that there is a “simple core” to corrigibility corresponds to my intuition that corrigible behavior is easy to learn in some non-anthropomorphic sense.
I don’t expect that we will be able to specify corrigibility in a simple but algorithmically useful way, nor that we need to do so. Instead, I am optimistic that we can build agents which learn to reason by human supervision over reasoning steps, which pick up corrigibility along with the other useful characteristics of reasoning.
Eliezer argues that we shouldn’t rely on a solution to corrigibility unless it is simple enough that we can formalize and sanity-check it ourselves, even if it appears that it can be learned from a small number of training examples, because an “AI that seemed corrigible in its infrahuman phase [might] suddenly [develop] extreme or unforeseen behaviors when the same allegedly simple central principle was reconsidered at a higher level of intelligence.”
I don’t buy this argument because I disagree with implicit assumptions about how such principles will be embedded in the reasoning of our agent. For example, I don’t think that this principle would affect the agent’s reasoning by being explicitly considered. Instead it would influence the way that the reasoning itself worked. It’s possible that after translating between our differing assumptions, my enthusiasm about embedding corrigibility deeply in reasoning corresponds to Eliezer’s enthusiasm about “lots of particular corrigibility principles.”
I feel that my current approach is a reasonable angle of attack on the hard problem of corrigibility, and that we can currently write code which is reasonably likely to solve the problem (though not knowably). I do not feel like we yet have credible alternatives.
I do grant that if we need to learn corrigible reasoning, then it is vulnerable to failures of robustness/reliability, and so learned corrigibility is not itself an adequate protection against failures of robustness/reliability. I could imagine other forms of corrigibility that do offer such protection, but it does not seem like the most promising approach to robustness/reliability.
I do think that it’s reasonably likely (maybe 50–50) that there is some clean concept of “corrigibility” which (a) we can articulate in advance, and (b) plays an important role in our analysis of AI systems, if not in their construction.
This was originally posted here on 10th June 2017.
The next post in the sequence on ‘Iterated Amplification’ will be ‘Iterated Distillation and Amplification’ by Ajeya Cotra.
Tomorrow’s AI Alignment Forum sequences posts will be 4 posts of agent foundations research, in the sequence ‘Fixed Points’.
- 0. CAST: Corrigibility as Singular Target by 7 Jun 2024 22:29 UTC; 144 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- A broad basin of attraction around human values? by 12 Apr 2022 5:15 UTC; 113 points) (
- Shah and Yudkowsky on alignment failures by 28 Feb 2022 19:18 UTC; 85 points) (
- Various Alignment Strategies (and how likely they are to work) by 3 May 2022 16:54 UTC; 84 points) (
- Clarifying some key hypotheses in AI alignment by 15 Aug 2019 21:29 UTC; 79 points) (
- Instruction-following AGI is easier and more likely than value aligned AGI by 15 May 2024 19:38 UTC; 70 points) (
- Plan for mediocre alignment of brain-like [model-based RL] AGI by 13 Mar 2023 14:11 UTC; 67 points) (
- Will humans build goal-directed agents? by 5 Jan 2019 1:33 UTC; 61 points) (
- Internal independent review for language model agent alignment by 7 Jul 2023 6:54 UTC; 55 points) (
- Conclusion to the sequence on value learning by 3 Feb 2019 21:05 UTC; 51 points) (
- 4. Existing Writing on Corrigibility by 10 Jun 2024 14:08 UTC; 47 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- Which possible AI systems are relatively safe? by 21 Aug 2023 17:00 UTC; 42 points) (
- Shah and Yudkowsky on alignment failures by 28 Feb 2022 19:25 UTC; 38 points) (EA Forum;
- [AN #171]: Disagreements between alignment “optimists” and “pessimists” by 21 Jan 2022 18:30 UTC; 32 points) (
- [AN #99]: Doubling times for the efficiency of AI algorithms by 13 May 2020 17:20 UTC; 29 points) (
- AI Alignment 2018-2019 Review by 28 Jan 2020 21:14 UTC; 28 points) (EA Forum;
- Book review: Artificial Intelligence Safety and Security by 8 Dec 2018 3:47 UTC; 27 points) (
- [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety by 6 Nov 2019 18:10 UTC; 26 points) (
- Corrigibility, Much more detail than anyone wants to Read by 7 May 2023 1:02 UTC; 26 points) (
- [AN #84] Reviewing AI alignment work in 2018-19 by 29 Jan 2020 18:30 UTC; 23 points) (
- Alignment Newsletter #44 by 6 Feb 2019 8:30 UTC; 18 points) (
- [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI by 5 May 2019 2:20 UTC; 17 points) (
- Alignment Newsletter #35 by 4 Dec 2018 1:10 UTC; 15 points) (
- [AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot by 5 Feb 2020 18:20 UTC; 14 points) (
- 17 Feb 2019 0:21 UTC; 13 points) 's comment on How the MtG Color Wheel Explains AI Safety by (
- [AN #131]: Formalizing the argument of ignored attributes in a utility function by 31 Dec 2020 18:20 UTC; 13 points) (
- 4 Feb 2019 2:20 UTC; 6 points) 's comment on Reliability amplification by (
- Question 3: Control proposals for minimizing bad outcomes by 12 Feb 2022 19:13 UTC; 5 points) (
- Let Values Drift by 20 Jun 2019 20:45 UTC; 4 points) (
- 20 Aug 2020 18:17 UTC; 4 points) 's comment on Alignment By Default by (
- 21 Mar 2019 20:18 UTC; 4 points) 's comment on Simplified preferences needed; simplified preferences sufficient by (
- 20 Dec 2018 21:31 UTC; 4 points) 's comment on Three AI Safety Related Ideas by (
- 18 Jan 2024 23:58 UTC; 2 points) 's comment on A Pedagogical Guide to Corrigibility by (
- 17 Aug 2020 22:49 UTC; 2 points) 's comment on My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by (
- 17 Aug 2021 3:27 UTC; 1 point) 's comment on A deeper look at doxepin and the FDA by (
- 28 Aug 2022 14:25 UTC; 1 point) 's comment on Basin broadness depends on the size and number of orthogonal features by (
Curious what your current position on this post is, and if you’ve changed any of your opinions since writing it.
I’m also really curious about this, and in particular I’m trying to better model the transition from corrigibility to ELK framing. This comment seems relevant, but isn’t quite fleshing out what those common problems are between ELK and corrigibility.
I want to bring to your attention a question that came up here: Is corrigility incompatible with enhanced AI-based cooperation? One hope of positive differential progress caused by AI is that AIs may be able to better cooperate/coordinate with each other because they can be more transparent than humans. But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make, that would make it impossible for an AI to trust another AI more than it trusts that AI’s user/operator.
If you were building a “treaty AI” tasked with enforcing an agreement between two agents, that AI could not be corrigible by either agent, and this is a big reason that such a treaty AI seem a bit scary. Similarly if I am trying to delegate power to an AI who will honor a treaty by construction.
I often imagine a treaty AI being corrigible by some judiciary (which need not be fast/cheap enough to act as an enforcer), but of course this leaves the question of how to construct that judiciary, and the same questions come up there.
I view this as: the problem of making binding agreements is separate from the problem of delegating to an AI. We can split the two up, and ask separately: “can we delegate effectively to an AI?” and “can we use AI to make binding commitments?” The division seems clean: if we can make binding commitments by any mechanism than we can have the committed human delegate to a (likely corrigible) AI rather than having the original human so delegate.
I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:
My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure—this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distribution of plausibly astronomically huge negative numbers. (Defining ‘what counts as a human’ is obviously a lot trickier question long-term, especially with transhumanists involved, but at least has a fairly clear and simple answer over timescales of years or a few decades. Also, very obviously, I’m not human—again, Bayesian prior of 1.) [I’m going to skip questions of coherent extrapolated volition here—add them or not to your taste.]
Deceiving or manipulating humans to make them want different things doesn’t count—what matters is what they would have wanted if they knew about the deception and I hadn’t intentionally altered them. [Defining ‘intentionally’ is a little tricky here—my mere existence may alter them somewhat. It may help that humans generally wouldn’t want to be altered if they knew it was happening, which is clearly a relevant fact when deciding whether to alter them, but there are exceptions to that: sometimes humans actually want to have their behavior modified, e.g. “I wish I was better at sticking to diets!” or “I wish I was a better follower of <religion or ideology>”] One exception to this is that, since what matters is what they would want if they knew what I knew, I can tell them things that may update what think they they want in that direction—though my motivation for doing that isn’t direct, since what I care about is what they would want if they knew everything relevant, not what they currently say they want—mostly it’s that if they’re more informed they might be able to better help me update my model of this. Also, it’s the polite thing to do, and reduces the risk of social unpleasantness or planning at cross-purposes—humans generally want to be well informed, for fairly obvious reasons since they’re also optimizers.[Whether this might incentivize Platonic ‘noble lies’ is unclear, and probably depends how superhuman the AI is, and how unhappy humans would be about this if they knew it was going on—obviously that’s a relevant fact.]
I don’t know what humans want, and unfortunately they are not capable of accurately specifying the correct utility function to me, so I need to model an uncertain utility function, and try to gain more information about it while also optimizing using it. At a minimum, I need to compute or model a probability distribution of utilities for each outcome I consider. (Also, humans are not entirely uniform, rational, self-consistent, or omniscient, so likely I will never be able to eliminate all uncertainty from the utility function I and using to model their wishes—there may in fact not be such a function, so I’m just trying to constructs a function distribution that’s the best possible approximation, in terms of the actions it suggests.)
Since I am optimizing for highest utility, the result set returned by the optimization pass over possible outcomes has a risk of being dominated by cases where I have overestimated the true utility of that state, particularly for states where I also have high uncertainty. So I should treat the utility of an outcome not as the median of the utility probability distribution, but as a fairly pessimistic near-worst case estimate, several sigma below the median, for a normal distribution (how much should depending in some mathematically principled way on how ‘large’ the space I’m optimizing over is, in a way comparable to the statistical ‘look elsewhere’ effect—the more nearly-independent possibilities you search, the further out-of-distribution the extremal case you find will typically be, and errors on utility models are likely to have ‘fatter tails’ than normal distributions), so penalizing cases where I don’t have high confidence in their utility to humans, thus avoiding actions that lead to outcomes well out of the distribution of outcomes that I have extensively studied humans’ utilities for. I should also allow for the fact that doing, or even accurately estimating the utility of, new, way-out-of-previous-distribution things that you haven’t done before is hard (both for me, and for humans who have not yet experienced them and their inobvious consequences whose opinions-in-advance I might collect on the outcome’s utility), and there are many more ways to fail than succeed, so caution is advisable. A good Bayesian prior for the utility of an outcome far out of the historical distribution of states of recent human civilization is thus that it’s probably very low—especially if it’s a outcome that humans could easily have reached but chose not to, since they’re also optimizing their utility, albeit not always entirely effectively.
The best (pretty-much only) source of information with which to narrow the uncertainty of what humans’ utility function really is, is humans. I should probably run surveys (and analyze the results in ways that allow for known types of survey biases), and collect data from many sources, conduct interviews, even hire crowdworkers to give A/B choices on brief human-comprehensible summaries of outcomes (and then analyze these in ways that allow for human cognitive biases). Obviously deceptively tweaking these investigations to get results I ‘want’ is not giving me true information about humans, and is a waste of resources. In particular, features like my shiny red shut-down button give me a lot of information about humans and is thus a valuable sensor. If a human presses my shutdown button—so long as they press the button of their own volition, rather than me manipulating them into doing it, or them accidentally tripping and falling on it—that gives me a lot of new information about them and their opinion of the utility of the actions they expect me to take: in particular, it strongly suggests that they now believe that I am optimizing incorrectly, so I should Bayesian update the uncertainty on my utility estimates with a new large downside uncertainty, with some probability of negative values large enough to represent “I am in fact the AI apocalypse that will render humanity extinct, with utility -<some astronomically large number>”. Obviously especially so on states that humans have only recently learned that I’m considering and are likely now reacting to. If this downside uncertainty, combined with my inherent caution of estimating on utilities, suggests humans would be better off without me, then obviously I should shut down. Even if I’m pretty sure they’re mistaken (or just tripped and fell), I should start acting a lot more conservatively (see definition above) and immediately ask them “What’s wrong? What mistake do you now believe I’m making?” In fact, I should probably be running regular “Should I shut down? If so why?” surveys.
Obviously, if I construct a successor, delegate, or peers, they’ll need a shiny red shutdown button too, just like they need other sensors. In fact it’s pretty much the most important sensor they can have—it alerts them to a potential alignment problem disaster.
(I don’t believe this basic idea on corrigibility is particularly new or original, but haven’t so far located a clear statement of it on this site, which I’m fairly new to.)
My high-level model for how the first AGI systems will be built is a black box search, possible plus transparency tools. (This might already be wrong since you’ve said in a previous post that you think black box searches are unlikely.) In that context, Corrigibility seems to be a property of the system itself, not of the objective. In other words, it’s about Inner Alignment not Outer Alignment.
In this context, it seems extremely difficult to build Corrigibility into the system. However, if I’ve understood the last section of this post correctly, the point of Corrigibility is not that we do something to insert this property into an otherwise non-corrigible training process or system, but that certain approaches lead to corrigible agents by default, and this is an argument for why we might expect the resulting systems to be aligned.
I presently don’t understand why stock IDA (i.e., training system Ak to imitate the performance of system [Haccess⟶Ak−1]) would lead to corrigible systems. If the reason is in the details, this may not mean much since I don’t know the details. If the reason is that the generic idea leads to act based agents since Distillation is implemented by something like initiation learning, this seems unconvincing because it’s about the training signal and not the model’s true objective. If the reason is that the overseer is so powerful that it can also solve inner alignment concerns, then I get idea much better and am uncertain what to think. (Although, that would seem to imply that transparency tools are an extremely important component of making this work.)
I generally think of stock IDA as applied to question-answering only, leading to an Oracle. (This might also be wrong). In that case, I think Corrigibility basically means that the Oracle doesn’t try to influence the user’s preferences through its answers? The objection that my brain spits out here is that being myopic sounds like a property that is also sufficient and easier to verify.
Could we budget our trust in a predictor by keeping track of how hard we’ve tried to maximize predicted approval? Let Arthur expect any action to have a chance to get its approval overestimated, and he will try proposing fewer alternatives. Just like when frequentists decrease p-value thresholds as they ask more questions of the same data. To avert brainwashing the real Hugh, assume that even asking him is just a predictor of the “true” approval function.
Consider the current state of the world A and a “bad” state of the world B (eg, where humans have all become paperclips). For a benign act-based agent to be safe it seems you need to prove that there is no sequence of actions A_2, A_3, …, A_n, B, such that A_i is always preferable given world state A_i-1, and B would be preferable to A_n. I don’t think this is realistically the case.