Rohin Shah comments on rohinmshah’s Shortform

Rohin Shah 18 Jan 2020 23:21 UTC
LW: 6 AF: 4
AF
I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability $R (s_{1}, s_{2}) = E_{τ \sim π} [γ^{n}]$ , where $π$ is the optimal policy for getting from $s_{1}$ to $s_{2}$ , and $n = | τ |$ is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.
Then, for the new paper when using a baseline, the future task value $V_{f u t u r e}^{*} (s, s^{'})$ is:
$E_{g, τ \sim π_{g}, τ^{'} \sim π_{g}^{'}} [γ^{max (n, n^{'})}]$
where $s^{'}$ is the baseline state and $g$ is the future goal.
In a deterministic environment, this can be rewritten as:
$V_{f u t u r e}^{*} (s, s^{'})$
$= E_{g} [γ^{max (n, n^{'})}]$
$= E_{g} [min (R (s, g), R (s^{'}, g))]$
$= E_{g} [R (s^{'}, g) - max (R (s^{'}, g) - R (s, g), 0)]$
$= E_{g} [R (s^{'}, g)] - E_{g} [max (R (s^{'}, g) - R (s, g), 0)]$
$= E_{g} [R (s^{'}, g)] - d_{R R} (s, s^{'})$
Here, $d_{R R}$ is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.
Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.
The optimal value function in the new paper is (page 3, and using my notation of $V_{f u t u r e}^{*}$ instead of their $V_{i}^{*}$ ):
$V^{*} (s_{t}) = max a_{t} \in A [r (s_{t}, a_{t}) + γ \sum s_{t + 1} \in S p (s_{t + 1} ∣ s_{t}, a_{t}) V^{*} (s_{t + 1}) + (1 - γ) β V_{f u t u r e}^{*}]$ .
This is the regular Bellman equation, but with the following augmented reward (here $s_{t}^{'}$ is the baseline state at time t):
Terminal states:
$r_{n e w} (s_{t})$
$= r (s_{t}) + β V_{f u t u r e}^{*} (s_{t}, s_{t}^{'})$
$= r (s_{t}) - β d_{R R} (s_{t}, s_{t}^{'}) + β E_{g} [R (s_{t}^{'}, g)]$
Non-terminal states:
$r_{n e w} (s_{t}, a_{t})$
$= r (s_{t}, a_{t}) + (1 - γ) β V_{f u t u r e}^{*} (s_{t}, s_{t}^{'})$
$= r (s_{t}) - (1 - γ) β d_{R R} (s_{t}, s_{t}^{'}) + (1 - γ) β E_{g} [R (s_{t}^{'}, g)]$
For comparison, the original relative reachability reward is:
$r_{R R} (s_{t}, a_{t}) = r (s_{t}) - β d_{R R} (s_{t}, s_{t}^{'})$
The first and third terms in $r_{n e w}$ are very similar to the two terms in $r_{R R}$ . The second term in $r_{n e w}$ only depends on the baseline.
All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of $(1 - γ)$ (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define $k = β (1 - γ)$ for convenience. Then, we have:
Non-terminal states:
$r_{n e w} (s_{t}, a_{t}) = r (s_{t}) - k d_{R R} (s_{t}, s_{t}^{'}) + k E_{g} [R (s_{t}^{'}, g)]$
$r_{R R} (s_{t}, a_{t}) = r (s_{t}) - β d_{R R} (s_{t}, s_{t}^{'})$
What used to be terminal states that are now self-loop states:
$r_{n e w} (s_{t}, a_{t}) = (1 - γ) r (s_{t}) - k d_{R R} (s_{t}, s_{t}^{'}) + k E_{g} [R (s_{t}^{'}, g)]$
$r_{R R} (s_{t}, a_{t}) = (1 - γ) r (s_{t}) - k d_{R R} (s_{t}, s_{t}^{'})$
Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:
First, the future state rewards have an extra term, $k E_{g} [R (s_{t}^{'}, g)]$ .
This term depends only on the baseline $s_{t}^{'}$ . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.
For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals $g$ that involve sushi.
Second, in non-terminal states, relative reachability weights the penalty by $β$ instead of $k = β (1 - γ)$ . Really since $β$ and thus $k$ is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from $β$ in non-terminal states to the smaller $β (1 - γ)$ in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)
Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.
(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)
What links here?
- Rohin Shah's comment on rohinmshah’s Shortform by Rohin Shah (24 Oct 2020 16:19 UTC; 4 points)