The aim of this post is simply to collect all the different “indifference” proposals that I know of. Let me know if I’ve missed any!
The basic idea is that the agent is following policy π0 with reward R0, and there is a button that if pressed (event B) will change the agent to following policy π1 with reward R1.
Policy versus reward
My initial proposal considered only change of utility functions—similar to reward functions—and assumed the policy would always be to maximise that utility.
Conversely, the Safely Interruptible Agents paper just looked at policy changes, while sticking to the same reward function.
In general, policy and reward changes can be treated similarly—what matters is the changes between Vπ0(R0) (the expected value of R0 under π0, as estimated by the agent) and Vπ1(R1).
Indeed, you could just treat it as a policy change, with the added caveat that πi also ‘chooses’ to claim their reward from Ri.
Method: Bias penalty or indifference
If we have an initial estimate p for the probability of B (hence of the agent following π1/R1), then if the agent takes an action a, this probability changes to pa.
One way of fighting this, is to penalise the agent for large changes p→pa. In the terms of this post, the agent is taking biased actions.
And to combat large biases, the agent could get an extra bias penalty, as explained in the Towards Interactive Inverse Reinforcement Learning poster.
But the rest of this post will focus on bias-indifference, rather than bias penalty.
Counterfactual: evidential, causal, translation
When comparing Vπ0(R0) and Vπ1(R1), it’s important to note that this is a counterfactual comparison: it’s comparing what would have happened under π0 versus π1, but only one of those policies is actually going to be happening, depending on whether the button press B happens.
Evidential counterfactual
The first idea is to have some sort of fixed probability for B versus ¬B. And then the agent acts as if the probability cannot be changed by anything the agent does.
Call this the “evidential” counterfactual. The problem with it is that it causes the agent to want to manipulate conditional probabilities (similarly how evidential decision theory can desire to manipulate the news). Basically, if the agent can decrease the probability of B while making the reward for B much larger, it will do so (as it doesn’t track the reduction of probability, by construction).
The factoring out variables approach seems to use this evidential counterfactual, and thus suffers from this problem.
Causal counterfactual
My original idea was that there was some low-probability event e such that if B happened AND e happened, then the utility/policy would not change.
The problem with this is that, as Eliezer and Jessica clarified, this contrasts Vπ0(R0|B,e) with Vπ1(R1|B,¬e), whereas we really want to contrast Vπ0(R0|¬B) with Vπ1(R1|B).
Now, if e is of low probability, Vπ1(R1|B)≈Vπ1(R1|B,¬e), except for bets that turn on the value of e. However the difference between Vπ0(R0|¬B) and Vπ0(R0|B,e) is more problematic. In the first case, humans have chosen not to press the button; in the second, humans have tried to press the button but failed to change the agent’s utility. Obviously human behaviour can be very different in both cases, translating to a very different value estimation.
Weight Translation
This method relies on some sort of initial ‘default’ probability of B, call this p. Then the agent considers an action a, and this probability changes to pa.
Thus after a, the agent expects to be maximising reward R0 with probability 1−pa, and maximising R1 with probability pa.
Let w0 and w1 be the ultimate expected weights of the rewards R0 and R1. By ultimate expected weights, we mean the expected weights after B or ¬B, and what the weights themselves are will be defined in later sections. The default expected weights are 1−p and p, while the expected weights given a are 1−pa and pa.
Then the translation approach wants to map the weights back to the default. Let w′0 and w′1 be the adjusted weights, then:
E(w0)=1−pa,E(w1)=pa,E(w′0)=1−p,E(w′1)=p.
The rest of this post will focus mainly on the weight translation “counterfactual”.
Update defaults: no or observations only
The bias penalty, the evidential counterfactual, and the translation approach depend on some default assessment of the probability of B. One could either set some fixed probability at the beginning and never update them, or allow them to be updated only by observations, and not by the agent’s own actions.
Translation type: simple, mean, or specific
Given a prior history h, let ph be the default probability of B and pha be the probability of B given that the agent takes action a. Let the current weight of R0 be w′(h).
The translation approach aims to correct the expected weights of R0 and R1 from w(ha) back to w(h). By definition we know that, if o is the next observation and μ(o|ha) is the probability of o given h and a:
∑oμ(o|ha)w′i(hao)=w′i(h).
A simple translation is one where there is a vector v such that all w′i(hao) are equal to wi(hao)+v.
A mean translation is one where having the mean equality is the key requirement; the w′i(hao) may be constrained in various ways, but the mean equality is the main requirement.
A specific translation is one that has specific values for w′i(hao), where the mean equality is a consequence of those specific values. The counterfactual approach can be seen
Weights: probabilities or extra rewards
Finally, we have to sort out what we mean by these ‘weights’.
The simplest is that these weights are probabilities. So, just before the choice between B and ¬B, the agent could have w′Bi for B and w′¬Bi. Then if B happens, the agent has reward Ri with probability w′Bi; and if ¬B happens, it has Ri with probability w′¬Bi.
If the AI didn’t take any actions at all, then w′B1=w′¬B0=1 and w′B0=w′¬B1=0, same as the original values.
The problem with that approach is that we must ensure the weights are constrained between 0 and 1.
Alternatively, changes in weights can be seen instead as adding extra rewards to the final rewards, rather than changing their probabilities. Thus if w′B1=1+a, and B happens, the agent’s reward function is (1+a)R0−aR1, and if w¬B0=1+b, the agent’s reward function is (1+b)R1−bR0.
This has the advantage that the weights may be negative or higher than one, but disadvantage that it may result in unusual mixed reward functions.
Examples
Given these terms, the indifference approach I described as the best is Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple for small translations, mean for large ones, and Weights: probabilities.
One could imagine slightly tweaking that approach, by using extra rewards for weights, and dropping the complicated conditions needed to keep the weights bounded between 0 and 1, allowing simple translations always. This would result in: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple, and Weights: extra rewards.
Finally, the counterfactual approach can be seen as: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: specific, and Weights: probabilities.
All the indifference designs
A putative new idea for AI control; index here.
The aim of this post is simply to collect all the different “indifference” proposals that I know of. Let me know if I’ve missed any!
The basic idea is that the agent is following policy π0 with reward R0, and there is a button that if pressed (event B) will change the agent to following policy π1 with reward R1.
Policy versus reward
My initial proposal considered only change of utility functions—similar to reward functions—and assumed the policy would always be to maximise that utility.
Conversely, the Safely Interruptible Agents paper just looked at policy changes, while sticking to the same reward function.
In general, policy and reward changes can be treated similarly—what matters is the changes between Vπ0(R0) (the expected value of R0 under π0, as estimated by the agent) and Vπ1(R1).
Indeed, you could just treat it as a policy change, with the added caveat that πi also ‘chooses’ to claim their reward from Ri.
Method: Bias penalty or indifference
If we have an initial estimate p for the probability of B (hence of the agent following π1/R1), then if the agent takes an action a, this probability changes to pa.
One way of fighting this, is to penalise the agent for large changes p→pa. In the terms of this post, the agent is taking biased actions.
And to combat large biases, the agent could get an extra bias penalty, as explained in the Towards Interactive Inverse Reinforcement Learning poster.
But the rest of this post will focus on bias-indifference, rather than bias penalty.
Counterfactual: evidential, causal, translation
When comparing Vπ0(R0) and Vπ1(R1), it’s important to note that this is a counterfactual comparison: it’s comparing what would have happened under π0 versus π1, but only one of those policies is actually going to be happening, depending on whether the button press B happens.
Evidential counterfactual
The first idea is to have some sort of fixed probability for B versus ¬B. And then the agent acts as if the probability cannot be changed by anything the agent does.
Call this the “evidential” counterfactual. The problem with it is that it causes the agent to want to manipulate conditional probabilities (similarly how evidential decision theory can desire to manipulate the news). Basically, if the agent can decrease the probability of B while making the reward for B much larger, it will do so (as it doesn’t track the reduction of probability, by construction).
The factoring out variables approach seems to use this evidential counterfactual, and thus suffers from this problem.
Causal counterfactual
My original idea was that there was some low-probability event e such that if B happened AND e happened, then the utility/policy would not change.
The problem with this is that, as Eliezer and Jessica clarified, this contrasts Vπ0(R0|B,e) with Vπ1(R1|B,¬e), whereas we really want to contrast Vπ0(R0|¬B) with Vπ1(R1|B).
Now, if e is of low probability, Vπ1(R1|B)≈Vπ1(R1|B,¬e), except for bets that turn on the value of e. However the difference between Vπ0(R0|¬B) and Vπ0(R0|B,e) is more problematic. In the first case, humans have chosen not to press the button; in the second, humans have tried to press the button but failed to change the agent’s utility. Obviously human behaviour can be very different in both cases, translating to a very different value estimation.
Weight Translation
This method relies on some sort of initial ‘default’ probability of B, call this p. Then the agent considers an action a, and this probability changes to pa.
Thus after a, the agent expects to be maximising reward R0 with probability 1−pa, and maximising R1 with probability pa.
Let w0 and w1 be the ultimate expected weights of the rewards R0 and R1. By ultimate expected weights, we mean the expected weights after B or ¬B, and what the weights themselves are will be defined in later sections. The default expected weights are 1−p and p, while the expected weights given a are 1−pa and pa.
Then the translation approach wants to map the weights back to the default. Let w′0 and w′1 be the adjusted weights, then:
E(w0)=1−pa,E(w1)=pa,E(w′0)=1−p,E(w′1)=p.
The rest of this post will focus mainly on the weight translation “counterfactual”.
Update defaults: no or observations only
The bias penalty, the evidential counterfactual, and the translation approach depend on some default assessment of the probability of B. One could either set some fixed probability at the beginning and never update them, or allow them to be updated only by observations, and not by the agent’s own actions.
Translation type: simple, mean, or specific
Given a prior history h, let ph be the default probability of B and pha be the probability of B given that the agent takes action a. Let the current weight of R0 be w′(h).
The translation approach aims to correct the expected weights of R0 and R1 from w(ha) back to w(h). By definition we know that, if o is the next observation and μ(o|ha) is the probability of o given h and a:
∑oμ(o|ha)w′i(hao)=w′i(h).
A simple translation is one where there is a vector v such that all w′i(hao) are equal to wi(hao)+v.
A mean translation is one where having the mean equality is the key requirement; the w′i(hao) may be constrained in various ways, but the mean equality is the main requirement.
A specific translation is one that has specific values for w′i(hao), where the mean equality is a consequence of those specific values. The counterfactual approach can be seen
Weights: probabilities or extra rewards
Finally, we have to sort out what we mean by these ‘weights’.
The simplest is that these weights are probabilities. So, just before the choice between B and ¬B, the agent could have w′Bi for B and w′¬Bi. Then if B happens, the agent has reward Ri with probability w′Bi; and if ¬B happens, it has Ri with probability w′¬Bi.
If the AI didn’t take any actions at all, then w′B1=w′¬B0=1 and w′B0=w′¬B1=0, same as the original values.
The problem with that approach is that we must ensure the weights are constrained between 0 and 1.
Alternatively, changes in weights can be seen instead as adding extra rewards to the final rewards, rather than changing their probabilities. Thus if w′B1=1+a, and B happens, the agent’s reward function is (1+a)R0−aR1, and if w¬B0=1+b, the agent’s reward function is (1+b)R1−bR0.
This has the advantage that the weights may be negative or higher than one, but disadvantage that it may result in unusual mixed reward functions.
Examples
Given these terms, the indifference approach I described as the best is Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple for small translations, mean for large ones, and Weights: probabilities.
One could imagine slightly tweaking that approach, by using extra rewards for weights, and dropping the complicated conditions needed to keep the weights bounded between 0 and 1, allowing simple translations always. This would result in: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple, and Weights: extra rewards.
Finally, the counterfactual approach can be seen as: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: specific, and Weights: probabilities.