In Cooperative Inverse Reinforcement Learning (CIRL), a human H and a robot R cooperate in order to best fullfil the human’s preferences. This is modeled as a Markov game M=⟨S,{AH,AR},T(⋅|⋅,⋅,⋅),{Θ,R(⋅,⋅,⋅;⋅)},P0(⋅,⋅),γ⟩.
This setup is not as complicated as it seems. There is a set S of states, and in any state, the human and robot take simultaenous actions, chosen from AH and AR respectively. The transition function T takes this state and the two actions, and gives the probability of the next state. The γ is the discount factor of the reward.
What is this reward? Well, the idea is that the reward is parameterised by a θ∈Θ, which only the human sees. Then R takes this parameter, the state, and the actions of both parties, and computes a reward; this is R(s,aH,aR;θ) for a state s and actions aH and aR by the human and robot respectively. Note that the robot will never observe this reward, it will simply compute it. The P0 is a joint probability distribution over the initial state s0, and the θ that will be observed by the human.
Behaviour in a CIRL game is defined by a pair of policies (πH,πR), that determine the action selection for H and R respectively. Each agent gets to observe the past actions of the other agent, so in general these policies could be arbitrary functions of their observation histories: πH:[AH×AR×S]∗×θ→AH and πR:[AH×AR×S]∗→AR.
The optimal joint policy is the policy that maximises value, which is the expected sum of discounted rewards. This optimal is the best H and R can do if they coordinate perfectly before H observes θ. It turns out that there exist optimal policies that depend only on the current state and R’s belief about θ.
Manipulation actions
My informal critique of CIRL is that it assume two untrue facts: that H knows θ (ie knows their own values) and that H is perfectly rational (or noisly rational in a specific way).
Since I’ve been developing more machinery in this area, I can now try and state this more formally.
Assume that M always starts in a fixed state s0, that the reward is always zero in this initial state (so R(s0,⋅,⋅;⋅)=0), and that transitions from this initial state are independent of the agent’s actions (so T(s|s0,⋅,⋅) is defined indendently of the actions). This makes R’s initial action aR0 irrelevant (since R has no private information to transmit).
Then let πH be the optimal policy for θ, and (πH)′ be the optimal policy for θ′ (this θ′ may be either independent of or dependent on θ).
Among the action set AR is a manipulative action a′ (this could involve tricking the human, drugging them, brain surgery, effective propaganda, etc...) If aR0=a′, the human H will pursue (πH)′; otherwise, they will pursue πH. If we designate I′ as the indicator variable of aR0=a′ (so it’s 1 if that happens and 0 otherwise), then this corresponds to following the compound policy:
π=I′(πH)′+(1−I′)πH.
This is well defined as policies map past sequences of states and actions, and I′ is well-defined given past actions, so the expression does map sequences of states and actions (and θ) to actions.
Decomposing the human policy
What is R to do with that strange compound policy? Let’s assume that R doesn’t know θ or θ′, but does know H sufficiently to predict the compound nature of π.
In one approach, R can see the policy π as partially irrational. So it decomposes π as (p,R) as in this paper, with R as the ‘true reward’ and p a map from rewards to policies, which encodes H’s rationality. The pair is compatible with the human policy if p(R)=π. Presumably here, R=R(⋅,⋅,⋅,θ) would eventually be deduced as the true reward.
But that very same paper shows that (p,R) cannot be deduced from π, so R would have to have some extra information (some ‘normative assumptions’) to allow for that decomposition. We might be tempted to have it simply recognise the manipulative nature of a′, but if R could classify all its manipulative actions, there wouldn’t be any problem in the first place (and this would be tantamount to knowing the decomposition (p,R) anyway).
Multiple rewards, or compound rewards
Note that there is generically no θ′′ that corresponds to the policy π. One might be tempted to say that H is maximising the compound reward:
I′R(⋅,⋅,⋅;θ)+(1−I′)R(⋅,⋅,⋅;θ′).
But that is not a valid reward, because I′ is defined over histories of states and actions, while the reward meta-function R only take the last state and actions.
In this circumstance, R is in practice choosing the human reward through its initial action. Assuming it has some non-trivial information about θ and θ′, all the issues about biasing and influencing rewards comes to the fore (technically, the setup I’ve described isn’t rich enough to allow for influential unbiased actions, but it can be easily enriched to allow that). The R will thus choose a′ or not as its first action, depending on whether it expects R(⋅,⋅,⋅;θ) to be easier or harder to maximise than R(⋅,⋅,⋅;θ′).
Another alternative is to extend the definition of rewards, to allow them to be defined over complete histories of states and actions, not just the last one. If we require that all such rewards be parameterised by elements of Θ, then there exists a θ′′ such that
R(⋅;θ′′)=I′R(⋅;θ)+(1−I′)R(⋅;θ′).
In that case R can conclude that the human is rationally signalling that it knows θ′′, and R is technically immune to bias issues, since R is merely updating its priors on Θ, rather than choosing the human reward.
There are three problems with this perspective. The first is that it’s wrong: the human knows θ, not θ′′. The second is that though R is not choosing the human reward in theory, it is choosing it in practice. Whether it chooses a′ as its first action or not, depends on its estimate for the value of R(⋅;θ) versus R(⋅;θ′), so the issues of bias and influence return. And finally, since optimal policies are unchanged by affine transformations of rewards, the policy π is also compatible with the reward functions:
I′(aR(⋅;θ)+b)+(1−I′)(a′R(⋅;θ)+b′),
for any a,a′>0 and any b,b′. So whether R(⋅;θ) or R(⋅;θ′) is chosen depends also on the prior over all those compatible reward functions.
Identifying compound rewards
But note that the third point (prior dependence) can be made to compensate for the second one (value of R(⋅;θ) versus R(⋅;θ′)). The constants a, a′, b, and b′ can be seen as normalisation constants.
So if R(⋅;θ′′) can be identified as a compound reward, maybe we can adjust the priors so that R(⋅;θ) and R(⋅;θ′) are normalised to having comparable value, so that there is no bias pressure to choose one or the other. This is similar to the indifference approaches.
The main problem here is the same that comes up in the discussion of grue and bleen and induction. ‘Compound reward’ is not a natural category. Just as R(⋅;θ′′) can be written as a compound mix of the other two rewards, we can define R(⋅;θ′′′)=I′R(⋅;θ′)+(1−I′)R(⋅;θ), then since I′(1−I′) is always 0, we can write the ‘basic’ rewards as compound rewards:
This may be solveable with simplicity priors, but it’s not clear that that’s the case; forcibly injecting the human with heroine, for example, could be seen as modelling the human as an approximate opiode-receptor agonist maximser, which seems a lot simpler than the actual human.
Revealed meta-preferences
Finally, there is one element I haven’t addressed, namely the human’s first action aH0, which is unspecified by π. It might be possible to use this as information to R which would allow it to decide between θ and θ′. But for that to work, the human H has to be aware of R’s possible manipulation, and have enough bandwidth to communicate their preferences of θ over θ′. I’ll try and return to this issue in future posts.
Biased reward-learning in CIRL
In Cooperative Inverse Reinforcement Learning (CIRL), a human H and a robot R cooperate in order to best fullfil the human’s preferences. This is modeled as a Markov game M=⟨S,{AH,AR},T(⋅|⋅,⋅,⋅),{Θ,R(⋅,⋅,⋅;⋅)},P0(⋅,⋅),γ⟩.
This setup is not as complicated as it seems. There is a set S of states, and in any state, the human and robot take simultaenous actions, chosen from AH and AR respectively. The transition function T takes this state and the two actions, and gives the probability of the next state. The γ is the discount factor of the reward.
What is this reward? Well, the idea is that the reward is parameterised by a θ∈Θ, which only the human sees. Then R takes this parameter, the state, and the actions of both parties, and computes a reward; this is R(s,aH,aR;θ) for a state s and actions aH and aR by the human and robot respectively. Note that the robot will never observe this reward, it will simply compute it. The P0 is a joint probability distribution over the initial state s0, and the θ that will be observed by the human.
Behaviour in a CIRL game is defined by a pair of policies (πH,πR), that determine the action selection for H and R respectively. Each agent gets to observe the past actions of the other agent, so in general these policies could be arbitrary functions of their observation histories: πH:[AH×AR×S]∗×θ→AH and πR:[AH×AR×S]∗→AR.
The optimal joint policy is the policy that maximises value, which is the expected sum of discounted rewards. This optimal is the best H and R can do if they coordinate perfectly before H observes θ. It turns out that there exist optimal policies that depend only on the current state and R’s belief about θ.
Manipulation actions
My informal critique of CIRL is that it assume two untrue facts: that H knows θ (ie knows their own values) and that H is perfectly rational (or noisly rational in a specific way).
Since I’ve been developing more machinery in this area, I can now try and state this more formally.
Assume that M always starts in a fixed state s0, that the reward is always zero in this initial state (so R(s0,⋅,⋅;⋅)=0), and that transitions from this initial state are independent of the agent’s actions (so T(s|s0,⋅,⋅) is defined indendently of the actions). This makes R’s initial action aR0 irrelevant (since R has no private information to transmit).
Then let πH be the optimal policy for θ, and (πH)′ be the optimal policy for θ′ (this θ′ may be either independent of or dependent on θ).
Among the action set AR is a manipulative action a′ (this could involve tricking the human, drugging them, brain surgery, effective propaganda, etc...) If aR0=a′, the human H will pursue (πH)′; otherwise, they will pursue πH. If we designate I′ as the indicator variable of aR0=a′ (so it’s 1 if that happens and 0 otherwise), then this corresponds to following the compound policy:
This is well defined as policies map past sequences of states and actions, and I′ is well-defined given past actions, so the expression does map sequences of states and actions (and θ) to actions.
Decomposing the human policy
What is R to do with that strange compound policy? Let’s assume that R doesn’t know θ or θ′, but does know H sufficiently to predict the compound nature of π.
In one approach, R can see the policy π as partially irrational. So it decomposes π as (p,R) as in this paper, with R as the ‘true reward’ and p a map from rewards to policies, which encodes H’s rationality. The pair is compatible with the human policy if p(R)=π. Presumably here, R=R(⋅,⋅,⋅,θ) would eventually be deduced as the true reward.
But that very same paper shows that (p,R) cannot be deduced from π, so R would have to have some extra information (some ‘normative assumptions’) to allow for that decomposition. We might be tempted to have it simply recognise the manipulative nature of a′, but if R could classify all its manipulative actions, there wouldn’t be any problem in the first place (and this would be tantamount to knowing the decomposition (p,R) anyway).
Multiple rewards, or compound rewards
Note that there is generically no θ′′ that corresponds to the policy π. One might be tempted to say that H is maximising the compound reward:
But that is not a valid reward, because I′ is defined over histories of states and actions, while the reward meta-function R only take the last state and actions.
In this circumstance, R is in practice choosing the human reward through its initial action. Assuming it has some non-trivial information about θ and θ′, all the issues about biasing and influencing rewards comes to the fore (technically, the setup I’ve described isn’t rich enough to allow for influential unbiased actions, but it can be easily enriched to allow that). The R will thus choose a′ or not as its first action, depending on whether it expects R(⋅,⋅,⋅;θ) to be easier or harder to maximise than R(⋅,⋅,⋅;θ′).
Another alternative is to extend the definition of rewards, to allow them to be defined over complete histories of states and actions, not just the last one. If we require that all such rewards be parameterised by elements of Θ, then there exists a θ′′ such that
In that case R can conclude that the human is rationally signalling that it knows θ′′, and R is technically immune to bias issues, since R is merely updating its priors on Θ, rather than choosing the human reward.
There are three problems with this perspective. The first is that it’s wrong: the human knows θ, not θ′′. The second is that though R is not choosing the human reward in theory, it is choosing it in practice. Whether it chooses a′ as its first action or not, depends on its estimate for the value of R(⋅;θ) versus R(⋅;θ′), so the issues of bias and influence return. And finally, since optimal policies are unchanged by affine transformations of rewards, the policy π is also compatible with the reward functions:
for any a,a′>0 and any b,b′. So whether R(⋅;θ) or R(⋅;θ′) is chosen depends also on the prior over all those compatible reward functions.
Identifying compound rewards
But note that the third point (prior dependence) can be made to compensate for the second one (value of R(⋅;θ) versus R(⋅;θ′)). The constants a, a′, b, and b′ can be seen as normalisation constants.
So if R(⋅;θ′′) can be identified as a compound reward, maybe we can adjust the priors so that R(⋅;θ) and R(⋅;θ′) are normalised to having comparable value, so that there is no bias pressure to choose one or the other. This is similar to the indifference approaches.
The main problem here is the same that comes up in the discussion of grue and bleen and induction. ‘Compound reward’ is not a natural category. Just as R(⋅;θ′′) can be written as a compound mix of the other two rewards, we can define R(⋅;θ′′′)=I′R(⋅;θ′)+(1−I′)R(⋅;θ), then since I′(1−I′) is always 0, we can write the ‘basic’ rewards as compound rewards:
This may be solveable with simplicity priors, but it’s not clear that that’s the case; forcibly injecting the human with heroine, for example, could be seen as modelling the human as an approximate opiode-receptor agonist maximser, which seems a lot simpler than the actual human.
Revealed meta-preferences
Finally, there is one element I haven’t addressed, namely the human’s first action aH0, which is unspecified by π. It might be possible to use this as information to R which would allow it to decide between θ and θ′. But for that to work, the human H has to be aware of R’s possible manipulation, and have enough bandwidth to communicate their preferences of θ over θ′. I’ll try and return to this issue in future posts.