Double Corrigibility: better Corrigibility
A putative new idea for AI control; index here.
Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U such that
U≤t = u
U>t = v + E(u|u→u) - E(v|u→v).
Here u→u designates the event that u remains unchanged, while u→v designates the change.
As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.
What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E’ with a high E’(u|u→u) and a low E’(v|u→v). Its ability to rewire is limited by the fact that E’ will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.
Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E’ probably can’t rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.
Double corrigibility
To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI’s current expectation/probability estimator. Let ρ′ be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That’s because E(C)=E(u0|u0→u0)-E(un |un-1→un, H), where u0 is the AI’s first utility, un its last, and H intermediate history. So some milder restrictions on the ui should suffice to bound C.
Then, whenever the AI takes an action that changes ρ to ρ′, it gets the special meta-compensatory reward
Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′).
This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.
Note that it doesn’t prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.
Since any action a might have an indirect effect on ρ, the “utility maximising” for a given u must be changed to:
Eρ(u|a) + Σρ′ Pρ(ρ→ρ′|a) (Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′)),
where Pρ is the probability estimate corresponding to ρ; the probability term can be rewritten as Eρ(Iρ→ρ′) for Iρ→ρ′ the indicator function for ρ→ρ′. In fact the whole line above can be rewritten as
Eρ(u|a) + Eρ(Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′) | a).
For this to work, Eρ needs to be able to say sensible things about itself, and also about Eρ′, which is used to estimate C if ρ→ρ′.
If we compare this with various ways of factoring out variables, we can see that it’s a case where we have a clear default, ρ, and are estimating deviations from that.
FYI, this is not what the word “corrigibility” means in this context. (Or, at least, it’s not how we at MIRI have been using it, and it’s not how Stuart Russell has been using it, and it’s not a usage that I, as one of the people who originally brought that word into the AI alignment space, endorse.) We use the phrase “utility indifference” to refer to what you’re calling “corrigibility”, and we use the word “corrigibility” for the broad vague problem that “utility indifference” was but one attempt to solve.
By analogy, imagine people groping around in the dark attempting to develop probability theory. They might call the whole topic the topic of “managing uncertainty,” and they might call specific attempts things like “fuzzy logic” or “multi-valued logic” before eventually settling on something that seems to work pretty well (which happened to be an attempt called “probability theory.”) We’re currently reserving the “corrigibilty” word for the analog of “managing uncertainty”; that is, we use the “corrigibility” label to refer to the highly general problem of developing AI algorithms that cause a system to (in an intuitive sense) reason without incentives to deceive/manipulate, and to reason (vaguely) as if it’s still under construction and potentially dangerous :-)
Good to know. I should probably move to your usage, as it’s more prevalent.
Will still use words like “corrigible” to refer to certain types of agents, though, since that makes sense for both definitions.