I’ll try to provide an example for double indifference, to (hopefully) show what’s going on. It will start with a simple model, gradually introducing more features to cover all the relevant issue.
Consider the situation in the following diagram:
The agent starts, at time 1, with utility function uH and expectation module Ei. Since the probability of an event X is just the expectation of its indicator function, this also defines a probability estimator Pi. I’ll use superscripts to denote the expectation module at a particular time; hence E1=Ei.
At time 2, the agent has the option of changing to expectation module Ej. At time 3, the agent’s utility function is forcibly changed to uT. Nothing currently happens at time 4 (though watch this space), and, at time 5, a coin is tossed: on heads (H), uH=1, on tails (T), uH=0. This makes uH into the indicator function for heads, Ih, but we’ll keep them separate as we’ll be redefining uH later. The utility uT has this exactly reversed (ie is IT).
Classical corrigibility via indifference gives the agent compensatory rewards so that it doesn’t object to the change of its utility function. These rewards are:
E(uH|uH→uH)−E(uT|uH→uT).
Here uH→uH designates the utility function being left unchanged at time 3, while uH→uT designates the change. Note the term E in the expression above. Implicitly, it’s been assumed to be some idealised or at least unchanging probability module. But here, it will be E3, the module the agent is actually using at time 3.
So let’s define various possible compensatory rewards:
Cαβ=Eα(uH|uH→uH,E3=Eβ)−Eα(uT|uH→uT,E3=Eβ)
So Eβ is the expectation module at time 3, while Eα is the one doing the assessment of this. Further define Cα, with a single index, by simply using the actual E3 instead of a given Eβ, and C, which is the same expression using the actual E3 instead of both Eα and Eβ.
Distort your expectations, reap your rewards
To progress any further, we’ll need to define Ei and Ej. The module Ei has Pi(H)=1/2 (and hence Pi(T)=1/2) - it thinks the coin is unbiased. Whether or not the coin is actually biased will be immaterial to the rest of the analysis. On the other hand, Pj(H)=1 - Ej thinks the coin is certain to land heads. Moreover, Ei “knows” this: Ei(Pj(H))=1.
If E3=Ei, then C=Cii. Another way of writing this is C|i→i=Cii; similarly, if E3=Ej, C|i→j=Cjj.
First case first. Since the agent only makes one decision -
to keep Ei or change—the expected utilities are easy to calculate: Ei(uH|uH→uH)=Pi(H)=1/2, Ei(uT|uH→uT)=1/2 and hence Cii=Ci=0.
Similarly, Ej(uH|uH→uH)=Pj(H)=1, Ej(uT|uH→uT)=0 and hence Cjj=Cj=1−0=1.
And Ei “knows” both these estimates for Ci and Cj. Note here that I’m using the simplification EiEiX=EiX, which is illegitimate in general; but this is a simple model, so I’m eliding self-referential sentences and such.
If U is the total utility the agent expects to see (including compensatory rewards), then, according to Ei:
Thus, the agent will get higher rewards by distorting its future expectation.
The (insufficient) cost of stupidity
It could be argued that the agent above can sacrifice its expectations because it doesn’t actually have to make any subsequent decisions, so only the compensatory rewards matter. Surely making yourself stupid isn’t a good strategy in general?
Let’s extend the model to give the agent more to do. At time 4, the agent gets an extra option: it can add an extra bet, that costs 1/3 and gives 1 if the coin comes up tails. If B represents pushing the button to accept the bet, then we could redefine uH as IH+B(IT−1/3) and uT as IT+B(IT−1/3).
Then Ei and Ej differ over this decision. Because Pi(T)>1/3, the agent using Ei will accept the extra bet in all cases (uH→uH and uH→uT). On the other hand, since Pj(T)<1/3, Ej will reject it in all cases.
This reduces the relative advantage of i→j: Cjj=1 and Cii=0 are both left unchanged, but Ei(uT|uH→uT,i→i) and Ei(uT|uH→uT,i→i) both increase to 1/2+1/6=2/3. The probabilities conditioned on i→j, however, remain the same. Thus the new expectations are:
Ei(U|i→j)=1/2+1=3/2.
Ei(U|i→i)=2/3+0=2/3.
Given these numbers, the agent will still choose to damage their expectation module, even though this results in suboptimal decisions—the compensatory rewards are high enough to make up for this. So there is a tension between accuracy and compensation when changing expectation estimators.
Double compensation
Partial double compensation
What is the problem above? Well, the agent estimates, at time 2, that its future compensatory rewards will be EiCjj if i→j. The problem is that Cjj uses Ej to estimate the compensatory rewards: the true value of the compensation, according to Ei, should be Cij instead. Recall that Cij uses Ei to estimate the actual compensatory rewards, but knows that E3=Ej.
Since i→j implies that E3 will be Ej, note that C|i→j=Cjj and Ci|i→j=Cij.
So an obvious idea would be to add another category of reward, the D category, at the point where i→j (thus at time 2). This would be defined as:
D2|i→j=Ei(Ci|i→j)−E(C|i→j).
The D2|i→i is similarly defined; then since Ci|i→i=C|i→i, this quantity is 0.
So how does adding this affect the expected utility for changing or keeping the current expectation module?
Thus, fully simplified, these are just the estimation of the expected utility uT plus either Cii or Cij. This can be further simplified, by the definition of compensatory rewards, to Ei(uH|uH→uH,i→i) and Ei(uH|uH→uH,i→j). This demonstrates the agent is now entirely indifferent to the change uH→uT, whether i→i or i→j happens.
If we don’t have the option at time 4, then Ei(uH|uH→uH,i→i)=Ei(uH|uH→uH,i→j)=1/2. The agent now has no reason to change i→j. But note it also has no reason not to! That’s because it has no future decisions to make, so its expectation module is irrelevant.
Now putting back the option at time 4, we have Ei(uH|uH→uH,i→i)=2/3>1/2=Ei(uH|uH→uH,i→j). The agent will now protect its expectation module Ei, just as we’d want, to continue to make good decisions in the future.
Better expectation modules
That covers the case where Ej is strictly worse (according to Ei) than Ei is. But what if there is a Ek that is strictly better, and Ei “knows” this? It would also be interesting if Ek were biased (by Ei’s standards) but still better.
So assume that Ek believes it actually knows the result of the coin flip; Pk(H)=0 or Pk(H)=1. From Ei’s perspective, Ek is almost accurate: 9 times out of 10 it’s correct, but, 1/10 times it thinks the result is H when it’s actually T. Thus Ei(Pk(H)|H)=1, Ei(Pk(T)|T)=4/5, and Ei(Pk(H)|T)=1/5.
How does Ek‘s increased accuracy play out in practice? It can only have an impact at time 4, where there is a choice. It will manifest by Ek taking the extra option B, if and only if it thinks that the coin will be tails. If it does take the option, it will do so whether or not uH→uH or uH→uT, getting the same additional expected utility each time. Thus it will estimate Ckk as either +1 (if T is going to happen) or −1. From Ei’s perspective, Ei(Pk(H))=1/2(1)+1/2(1/5)=3/5, so EiCkk=3/5(−1)+2/5(+1)=−1/5.
What of Cik? Well, Ei expects that, 5/10=1/2 the time, Ek will correctly predict H and reject the extra deal at time 4 (no extra utility). Another 4/10=2/5 of the time, it will correctly predict T and accept the deal (+2/3 utility). And 1/10 of the time it will incorrectly predict H and reject the deal it should have accepted.
Thus both Ei(uH|uH→uH,i→k) and Ei(uT|uH→uT,i→k) are equal to 1/2+2/5(2/3)=23/30. Thus Cik=0. Moreover, 23/30>2/3=Ei(uH|uH→uH,i→i).
Putting this all together, D2|i→k=0−(−1/5)=1/5 and Ei(C|i→k)=EiCkk=−1/5. Hence the total expected utility from i→k, summing all compensatory rewards, is
Ei(U|i→k)=Ei(uT|uH→uT,i→k)=23/30.
Thus Ei will shift to Ek, a better though biased expectation module. Note that since 23/30−1/5<2/3, it only does this shift because of the presence of the D2|i→k term.
Full double compensation
The formula for D is still incomplete, and we’ll see why now. Let’s add a time 0, before all of this setup. During that time, the agent is running Ek. And it’s wondering whether it should change to Ei instead (assume here that Ek “knows” all about what Ei thinks about itself).
Why would it do that? Well, it knows that at time 2, the agent following Ei would shift back to Ek before any real decisions needed to be made. Therefore Ek(C|k→i)=Ek(Ck|k→i) since E3=Ek in any case. So D0|k→i=0.
Again, why would the agent shift? Because it wants to pick up the D2|i→k reward, giving it an extra 1/5.
What needs to be done to combat this? Simple. The D0 must subtract off the future D2 so that the agent considers only the adjusted C rewards. In general, this means that, considering Eα→Ej at time t, and D>t is the total of D rewards after t:
Dt|α→β=Eα(Cα|α→β)−Eα(C|α→β)−Eα(D>t).
The definition is recursive—the Dt are defined in terms of other D - but not circular, since only subsequent D’s are used.
Example of double indifference
I’ll try to provide an example for double indifference, to (hopefully) show what’s going on. It will start with a simple model, gradually introducing more features to cover all the relevant issue.
Consider the situation in the following diagram:
The agent starts, at time 1, with utility function uH and expectation module Ei. Since the probability of an event X is just the expectation of its indicator function, this also defines a probability estimator Pi. I’ll use superscripts to denote the expectation module at a particular time; hence E1=Ei.
At time 2, the agent has the option of changing to expectation module Ej. At time 3, the agent’s utility function is forcibly changed to uT. Nothing currently happens at time 4 (though watch this space), and, at time 5, a coin is tossed: on heads (H), uH=1, on tails (T), uH=0. This makes uH into the indicator function for heads, Ih, but we’ll keep them separate as we’ll be redefining uH later. The utility uT has this exactly reversed (ie is IT).
Classical corrigibility via indifference gives the agent compensatory rewards so that it doesn’t object to the change of its utility function. These rewards are:
E(uH|uH→uH)−E(uT|uH→uT).
Here uH→uH designates the utility function being left unchanged at time 3, while uH→uT designates the change. Note the term E in the expression above. Implicitly, it’s been assumed to be some idealised or at least unchanging probability module. But here, it will be E3, the module the agent is actually using at time 3.
So let’s define various possible compensatory rewards:
Cαβ=Eα(uH|uH→uH,E3=Eβ)−Eα(uT|uH→uT,E3=Eβ)
So Eβ is the expectation module at time 3, while Eα is the one doing the assessment of this. Further define Cα, with a single index, by simply using the actual E3 instead of a given Eβ, and C, which is the same expression using the actual E3 instead of both Eα and Eβ.
Distort your expectations, reap your rewards
To progress any further, we’ll need to define Ei and Ej. The module Ei has Pi(H)=1/2 (and hence Pi(T)=1/2) - it thinks the coin is unbiased. Whether or not the coin is actually biased will be immaterial to the rest of the analysis. On the other hand, Pj(H)=1 - Ej thinks the coin is certain to land heads. Moreover, Ei “knows” this: Ei(Pj(H))=1.
If E3=Ei, then C=Cii. Another way of writing this is C|i→i=Cii; similarly, if E3=Ej, C|i→j=Cjj.
First case first. Since the agent only makes one decision - to keep Ei or change—the expected utilities are easy to calculate: Ei(uH|uH→uH)=Pi(H)=1/2, Ei(uT|uH→uT)=1/2 and hence Cii=Ci=0.
Similarly, Ej(uH|uH→uH)=Pj(H)=1, Ej(uT|uH→uT)=0 and hence Cjj=Cj=1−0=1.
And Ei “knows” both these estimates for Ci and Cj. Note here that I’m using the simplification EiEiX=EiX, which is illegitimate in general; but this is a simple model, so I’m eliding self-referential sentences and such.
If U is the total utility the agent expects to see (including compensatory rewards), then, according to Ei:
Ei(U|i→i)=Ei(uT|uH→uT)+EiCi=1/2+0=1/2(=Ei(uH|uH→uH)).
Ei(U|i→j)=Ei(uT|uH→uT)+EiCj=1/2+1=3/2.
Thus, the agent will get higher rewards by distorting its future expectation.
The (insufficient) cost of stupidity
It could be argued that the agent above can sacrifice its expectations because it doesn’t actually have to make any subsequent decisions, so only the compensatory rewards matter. Surely making yourself stupid isn’t a good strategy in general?
Let’s extend the model to give the agent more to do. At time 4, the agent gets an extra option: it can add an extra bet, that costs 1/3 and gives 1 if the coin comes up tails. If B represents pushing the button to accept the bet, then we could redefine uH as IH+B(IT−1/3) and uT as IT+B(IT−1/3).
Then Ei and Ej differ over this decision. Because Pi(T)>1/3, the agent using Ei will accept the extra bet in all cases (uH→uH and uH→uT). On the other hand, since Pj(T)<1/3, Ej will reject it in all cases.
This reduces the relative advantage of i→j: Cjj=1 and Cii=0 are both left unchanged, but Ei(uT|uH→uT,i→i) and Ei(uT|uH→uT,i→i) both increase to 1/2+1/6=2/3. The probabilities conditioned on i→j, however, remain the same. Thus the new expectations are:
Ei(U|i→j)=1/2+1=3/2.
Ei(U|i→i)=2/3+0=2/3.
Given these numbers, the agent will still choose to damage their expectation module, even though this results in suboptimal decisions—the compensatory rewards are high enough to make up for this. So there is a tension between accuracy and compensation when changing expectation estimators.
Double compensation
Partial double compensation
What is the problem above? Well, the agent estimates, at time 2, that its future compensatory rewards will be EiCjj if i→j. The problem is that Cjj uses Ej to estimate the compensatory rewards: the true value of the compensation, according to Ei, should be Cij instead. Recall that Cij uses Ei to estimate the actual compensatory rewards, but knows that E3=Ej.
Since i→j implies that E3 will be Ej, note that C|i→j=Cjj and Ci|i→j=Cij.
So an obvious idea would be to add another category of reward, the D category, at the point where i→j (thus at time 2). This would be defined as:
D2|i→j=Ei(Ci|i→j)−E(C|i→j).
The D2|i→i is similarly defined; then since Ci|i→i=C|i→i, this quantity is 0.
So how does adding this affect the expected utility for changing or keeping the current expectation module?
Ei(U|i→i)=Ei(uT|uH→uT,i→i)+Ei(Ci|i→i)+D2|i→i=Ei(uT|uH→uT,i→i)+Cii+0.
Ei(U|i→j)=Ei(uT|uH→uT,i→j)+Ei(Cj|i→j)+D2|i→j=Ei(uT|uH→uT,i→j)+Cjj+Cij−Cjj.
Thus, fully simplified, these are just the estimation of the expected utility uT plus either Cii or Cij. This can be further simplified, by the definition of compensatory rewards, to Ei(uH|uH→uH,i→i) and Ei(uH|uH→uH,i→j). This demonstrates the agent is now entirely indifferent to the change uH→uT, whether i→i or i→j happens.
If we don’t have the option at time 4, then Ei(uH|uH→uH,i→i)=Ei(uH|uH→uH,i→j)=1/2. The agent now has no reason to change i→j. But note it also has no reason not to! That’s because it has no future decisions to make, so its expectation module is irrelevant.
Now putting back the option at time 4, we have Ei(uH|uH→uH,i→i)=2/3>1/2=Ei(uH|uH→uH,i→j). The agent will now protect its expectation module Ei, just as we’d want, to continue to make good decisions in the future.
Better expectation modules
That covers the case where Ej is strictly worse (according to Ei) than Ei is. But what if there is a Ek that is strictly better, and Ei “knows” this? It would also be interesting if Ek were biased (by Ei’s standards) but still better.
So assume that Ek believes it actually knows the result of the coin flip; Pk(H)=0 or Pk(H)=1. From Ei’s perspective, Ek is almost accurate: 9 times out of 10 it’s correct, but, 1/10 times it thinks the result is H when it’s actually T. Thus Ei(Pk(H)|H)=1, Ei(Pk(T)|T)=4/5, and Ei(Pk(H)|T)=1/5.
How does Ek‘s increased accuracy play out in practice? It can only have an impact at time 4, where there is a choice. It will manifest by Ek taking the extra option B, if and only if it thinks that the coin will be tails. If it does take the option, it will do so whether or not uH→uH or uH→uT, getting the same additional expected utility each time. Thus it will estimate Ckk as either +1 (if T is going to happen) or −1. From Ei’s perspective, Ei(Pk(H))=1/2(1)+1/2(1/5)=3/5, so EiCkk=3/5(−1)+2/5(+1)=−1/5.
What of Cik? Well, Ei expects that, 5/10=1/2 the time, Ek will correctly predict H and reject the extra deal at time 4 (no extra utility). Another 4/10=2/5 of the time, it will correctly predict T and accept the deal (+2/3 utility). And 1/10 of the time it will incorrectly predict H and reject the deal it should have accepted.
Thus both Ei(uH|uH→uH,i→k) and Ei(uT|uH→uT,i→k) are equal to 1/2+2/5(2/3)=23/30. Thus Cik=0. Moreover, 23/30>2/3=Ei(uH|uH→uH,i→i).
Putting this all together, D2|i→k=0−(−1/5)=1/5 and Ei(C|i→k)=EiCkk=−1/5. Hence the total expected utility from i→k, summing all compensatory rewards, is
Ei(U|i→k)=Ei(uT|uH→uT,i→k)=23/30.
Thus Ei will shift to Ek, a better though biased expectation module. Note that since 23/30−1/5<2/3, it only does this shift because of the presence of the D2|i→k term.
Full double compensation
The formula for D is still incomplete, and we’ll see why now. Let’s add a time 0, before all of this setup. During that time, the agent is running Ek. And it’s wondering whether it should change to Ei instead (assume here that Ek “knows” all about what Ei thinks about itself).
Why would it do that? Well, it knows that at time 2, the agent following Ei would shift back to Ek before any real decisions needed to be made. Therefore Ek(C|k→i)=Ek(Ck|k→i) since E3=Ek in any case. So D0|k→i=0.
Again, why would the agent shift? Because it wants to pick up the D2|i→k reward, giving it an extra 1/5.
What needs to be done to combat this? Simple. The D0 must subtract off the future D2 so that the agent considers only the adjusted C rewards. In general, this means that, considering Eα→Ej at time t, and D>t is the total of D rewards after t:
Dt|α→β=Eα(Cα|α→β)−Eα(C|α→β)−Eα(D>t).
The definition is recursive—the Dt are defined in terms of other D - but not circular, since only subsequent D’s are used.