I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.
Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:
Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]
where s′ is the baseline state and g is the future goal.
In a deterministic environment, this can be rewritten as:
V∗future(s,s′)
=Eg[γmax(n,n′)]
=Eg[min(R(s,g),R(s′,g))]
=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−dRR(s,s′)
Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.
Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.
The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):
This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):
Terminal states:
rnew(st)
=r(st)+βV∗future(st,s′t)
=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]
Non-terminal states:
rnew(st,at)
=r(st,at)+(1−γ)βV∗future(st,s′t)
=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]
For comparison, the original relative reachability reward is:
rRR(st,at)=r(st)−βdRR(st,s′t)
The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.
All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:
Non-terminal states:
rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=r(st)−βdRR(st,s′t)
What used to be terminal states that are now self-loop states:
rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)
Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:
First, the future state rewards have an extra term, kEg[R(s′t,g)].
This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.
For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.
Second, in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)
Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.
(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)
I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.
Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:
Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]
where s′ is the baseline state and g is the future goal.
In a deterministic environment, this can be rewritten as:
V∗future(s,s′)
=Eg[γmax(n,n′)]
=Eg[min(R(s,g),R(s′,g))]
=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−dRR(s,s′)
Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.
Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.
The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):
V∗(st)=maxat∈A[r(st,at)+γ∑st+1∈Sp(st+1∣st,at)V∗(st+1)+(1−γ)βV∗future].
This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):
Terminal states:
rnew(st)
=r(st)+βV∗future(st,s′t)
=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]
Non-terminal states:
rnew(st,at)
=r(st,at)+(1−γ)βV∗future(st,s′t)
=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]
For comparison, the original relative reachability reward is:
rRR(st,at)=r(st)−βdRR(st,s′t)
The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.
All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:
Non-terminal states:
rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=r(st)−βdRR(st,s′t)
What used to be terminal states that are now self-loop states:
rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)
Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:
First, the future state rewards have an extra term, kEg[R(s′t,g)].
This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.
For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.
Second, in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)
Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.
(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)