Note: this post replaces and supersedes the previous one which is now obsolete.
Hacking utility indifference
Utility indifference was a technique for corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U such that
U≤t=u.
U>t=v+E(u|u→u)−E(v|u→v).
Here u→u designates the event that u remains unchanged, while u→v designates the change. This is a permitted utility change—one that we have set the agent up to accept—rather than just any old change in utility, which the agent will still attempt to resist.
As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.
What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E′ with a high E′(u|u→u) and a low E′(v|u→v). Its ability to rewire is limited by the fact that E′ will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.
Notice that the AI is now no longer indifferent to the permitted change of utility. It can benefit from rewiring E this way only if a change in utility is coming up.
Probability estimators
Let ρi be probability estimators. These define both expectation estimators Eρi and probability estimators Pρi. These two are equivalent in principle, since for any random variable X with Ix the indicator function for X=x, then Pρi(X=x)=EρiIx and EρiX=∑xPρi(X=x).
Double indifference
One idea to combat this is to apply indifference to the process of changing probability estimators. Thus an AI with estimator ρi changing to ρj would get compensatory reward
Eρi(U|ρi→ρi)−Eρi(U|ρi→ρj).
This won’t work, however, since we don’t want to discount all the impacts of the change ρi→ρj (maybe ρj is a much more accurate estimator), just the part of the change that affects the compensatory rewards. So the second layer of indifference will be a bit different from the previous one.
First, a subtlety. If we designate by ρ=ρi the fact that the agent uses estimator ρi, it is very possible that Eρi(u|ρ=ρj)>Eρi(u|ρ=ρi). This simply means that, in ρi‘s ‘opinion’, ρj is a better estimator for maximising u than it is itself.
Now, assume it’s at a time t where the agent is using estimator ρt and current utility ut. Let C>t(u,v,ρ′) be all future compensatory rewards of the type Eρ′(u|u→u,ρ=ρ′)−Eρ′(v|u→v,ρ=ρ′), for permitted change u→v. Let C>t(u,v,ρ′′,ρ′) be the same, except the expectation estimator is ρ′′: thus it is Eρ′′(u|u→u,ρ=ρ′)−Eρ′′(v|u→v,ρ=ρ′). Note that C>t(u,v,ρ′,ρ′)=C>t(u,v,ρ′).
Then define C>t (and C>t(ρ′′)) by summing C>t(u,v,ρ′) (and C>t(u,v,ρ′′,ρ′)) over all u, v, and ρ′.
Let D>t be all future compensatory rewards that are not of the type C>t. Then if the agent changes ρt into ρi, it gets the compensatory reward
Note that Dt is defined in terms of D>t, so, though there may be convergence issues, the definition is not circular.
Then the first two terms of Dt cancel out, according to the agent’s best current expectation, all future compensatory rewards. The third term replaces all the C type compensatory rewards with the same, except estimated by the agent’s current ρt rather than by whichever estimator the agent will have at the time. Thus, the agent will be indifferent to all actual compensatory rewards (as it gets none in expectation), and instead gain estimated rewards that make it indifferent to any future change ui→uj.
See this post to see the same formula in AIXI formalism.
There are certainly cases where the above terms converge (such as all bounded ui, finite number of possible changes to both utility and ρ). A full discussion of convergence will wait for another post.
Sequentially unbiased estimators
Call the estimators {ρi} sequentially unbiased if for all i<j, and all X, EρiEρjX=EρiX.
Then the main result is:
If the different ρi used by the agent are sequentially unbiased when labelled sequentially, then classical corrigibility is still sufficient.
To prove this, imagine that the agent will see a single change of utility u→v, and that it has the opportunity of changing ρi (to ρj, say) exactly once before the utility change. Then there will be no D term for ρj, and, if it does change ρi→ρj, the D term is:
Dt=−Eρi(C>t|ρi→ρj)+Eρi(C>t(ρi)|ρi→ρj).
However, Eρi(C>t|ρi→ρj)=Eρi(Eρj(u|u→u,ρ=ρj)−Eρj(v|u→v,ρ=ρj)). Because of sequential unbiasedness, this simplifies to Eρi(u|u→u,ρ=ρj)−Eρi(v|u→v,ρ=ρj), which is just Eρi(C>t(ρi)|ρi→ρj). So Dt=0.
We can then recurse to the change in ρ just before ρi, and get the same result (since the future D is still zero). And so on, with D always being zero. Then since the formulas defining D are linear, we can extend this to general environments and general utility function changes, and conclude that for sequentially unbiased ρi, the D are always 0 under double indifference (modulo some convergence issues not addressed here). Therefore, double indifference will work, even if we don’t use D’s at all: thus classical indifference still works in this case.
Note the similarity of the sequential unbiasness with the conditions for successful value learners in the Cake or Death problem.
Double indifference is better indifference
A putative new idea for AI control; index here.
Note: this post replaces and supersedes the previous one which is now obsolete.
Hacking utility indifference
Utility indifference was a technique for corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U such that
U≤t=u.
U>t=v+E(u|u→u)−E(v|u→v).
Here u→u designates the event that u remains unchanged, while u→v designates the change. This is a permitted utility change—one that we have set the agent up to accept—rather than just any old change in utility, which the agent will still attempt to resist.
As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.
What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E′ with a high E′(u|u→u) and a low E′(v|u→v). Its ability to rewire is limited by the fact that E′ will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.
Notice that the AI is now no longer indifferent to the permitted change of utility. It can benefit from rewiring E this way only if a change in utility is coming up.
Probability estimators
Let ρi be probability estimators. These define both expectation estimators Eρi and probability estimators Pρi. These two are equivalent in principle, since for any random variable X with Ix the indicator function for X=x, then Pρi(X=x)=EρiIx and EρiX=∑xPρi(X=x).
Double indifference
One idea to combat this is to apply indifference to the process of changing probability estimators. Thus an AI with estimator ρi changing to ρj would get compensatory reward
Eρi(U|ρi→ρi)−Eρi(U|ρi→ρj).
This won’t work, however, since we don’t want to discount all the impacts of the change ρi→ρj (maybe ρj is a much more accurate estimator), just the part of the change that affects the compensatory rewards. So the second layer of indifference will be a bit different from the previous one.
First, a subtlety. If we designate by ρ=ρi the fact that the agent uses estimator ρi, it is very possible that Eρi(u|ρ=ρj)>Eρi(u|ρ=ρi). This simply means that, in ρi‘s ‘opinion’, ρj is a better estimator for maximising u than it is itself.
Now, assume it’s at a time t where the agent is using estimator ρt and current utility ut. Let C>t(u,v,ρ′) be all future compensatory rewards of the type Eρ′(u|u→u,ρ=ρ′)−Eρ′(v|u→v,ρ=ρ′), for permitted change u→v. Let C>t(u,v,ρ′′,ρ′) be the same, except the expectation estimator is ρ′′: thus it is Eρ′′(u|u→u,ρ=ρ′)−Eρ′′(v|u→v,ρ=ρ′). Note that C>t(u,v,ρ′,ρ′)=C>t(u,v,ρ′).
Then define C>t (and C>t(ρ′′)) by summing C>t(u,v,ρ′) (and C>t(u,v,ρ′′,ρ′)) over all u, v, and ρ′.
Let D>t be all future compensatory rewards that are not of the type C>t. Then if the agent changes ρt into ρi, it gets the compensatory reward
Dt=−Eρt(D>t|ρt→ρi)−Eρt(C>t|ρt→ρi)+Eρt(C>t(ρt)|ρt→ρi).
Note that Dt is defined in terms of D>t, so, though there may be convergence issues, the definition is not circular.
Then the first two terms of Dt cancel out, according to the agent’s best current expectation, all future compensatory rewards. The third term replaces all the C type compensatory rewards with the same, except estimated by the agent’s current ρt rather than by whichever estimator the agent will have at the time. Thus, the agent will be indifferent to all actual compensatory rewards (as it gets none in expectation), and instead gain estimated rewards that make it indifferent to any future change ui→uj.
See this post to see the same formula in AIXI formalism.
There are certainly cases where the above terms converge (such as all bounded ui, finite number of possible changes to both utility and ρ). A full discussion of convergence will wait for another post.
Sequentially unbiased estimators
Call the estimators {ρi} sequentially unbiased if for all i<j, and all X, EρiEρjX=EρiX. Then the main result is:
If the different ρi used by the agent are sequentially unbiased when labelled sequentially, then classical corrigibility is still sufficient.
To prove this, imagine that the agent will see a single change of utility u→v, and that it has the opportunity of changing ρi (to ρj, say) exactly once before the utility change. Then there will be no D term for ρj, and, if it does change ρi→ρj, the D term is:
Dt=−Eρi(C>t|ρi→ρj)+Eρi(C>t(ρi)|ρi→ρj).
However, Eρi(C>t|ρi→ρj)=Eρi(Eρj(u|u→u,ρ=ρj)−Eρj(v|u→v,ρ=ρj)). Because of sequential unbiasedness, this simplifies to Eρi(u|u→u,ρ=ρj)−Eρi(v|u→v,ρ=ρj), which is just Eρi(C>t(ρi)|ρi→ρj). So Dt=0.
We can then recurse to the change in ρ just before ρi, and get the same result (since the future D is still zero). And so on, with D always being zero. Then since the formulas defining D are linear, we can extend this to general environments and general utility function changes, and conclude that for sequentially unbiased ρi, the D are always 0 under double indifference (modulo some convergence issues not addressed here). Therefore, double indifference will work, even if we don’t use D’s at all: thus classical indifference still works in this case.
Note the similarity of the sequential unbiasness with the conditions for successful value learners in the Cake or Death problem.