Rohin Shah comments on rohinmshah’s Shortform

Rohin Shah 17 Mar 2020 1:20 UTC
LW: 7 AF: 4
AF
Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to $R$ on $V_{R}$ and $Q_{R}$ .
$R_{AUP} (s, a) = R (s) - λ \frac{max (V^{*} (T (s, a)) - V^{*} (T (s, \emptyset)), 0)}{V^{*} (s) - Q^{*} (s, \emptyset)}$
Consider some starting state $s_{0}$ , some starting action $a_{0}$ , and consider the optimal trajectory under $R$ that starts with that, which we’ll denote as $s_{0} a_{0} s_{1} a_{1} s_{2} \dots$ . Define $s_{i}^{'} = T (s_{i - 1}, \emptyset)$ to be the one-step inaction states. Assume that $Q^{*} (s_{0}, a_{0}) > Q^{*} (s_{0}, \emptyset)$ . Since all other actions are optimal for $R$ , we have $V^{*} (s_{i}) = \frac{1}{γ} (V^{*} (s_{i - 1}) - R (s_{i - 1})) \geq \frac{1}{γ} (Q^{*} (s_{i - 1}, \emptyset) - R (s_{i - 1})) = V^{*} (s_{i}^{'})$ , so the max in the equation above goes away, and the total $R_{AUP}$ obtained is:
$R_{AUP} (s_{0}, a_{0}) + (\infty \sum i = 1 γ^{i} R (s_{i}, a_{i})) - λ (\infty \sum i = 2 \frac{V^{*} (s_{i}) - V^{*} (s_{i}^{'})}{V^{*} (s_{i - 1}) - Q^{*} (s_{i - 1}, \emptyset)})$
Since we’re considering the optimal trajectory, we have $V^{*} (s_{i - 1}) - Q^{*} (s_{i - 1}, \emptyset) = [R (s) + γ V^{*} (s_{i})] - [R (s) + γ V^{*} (s_{i}^{'})] = γ (V^{*} (s_{i}) - V^{*} (s_{i}^{'}))$
Substituting this back in, we get that the total $R_{A U P}$ for the optimal trajectory is $R_{AUP} (s_{0}, a_{0}) + (\infty \sum i = 1 γ^{i} R (s_{i}, a_{i})) - λ (\infty \sum i = 2 \frac{1}{γ})$
which… uh… diverges to negative infinity, as long as $γ < 1$ . (Technically I’ve assumed that $V^{*} (s_{i}) - V^{*} (s_{i}^{'})$ is nonzero, which is an assumption that there is always an action that is better than $\emptyset$ .)
So, you must prefer the always- $\emptyset$ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where $\emptyset$ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.
----
Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?
Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that $\emptyset$ is guaranteed to be a noop: for any state $s$ , we have $T (s, \emptyset) = s$ .
Now, for any trajectory $s_{0} a_{0} \dots$ with $s_{i}^{'}$ defined as before, we have $V^{*} (s_{i}^{'}) = V^{*} (s_{i - 1})$ , so $V^{*} (s_{i - 1}) - Q^{*} (s_{i - 1}, \emptyset) = V^{*} (s_{i - 1}) - [R (s_{i - 1}) + γ V^{*} (s_{i - 1})] = (1 - γ) V^{*} (s_{i - 1}) - R (s_{i - 1})$
As a check, in the case where $a_{i - 1}$ is optimal, we have $V^{*} (s_{i}) - V^{*} (s_{i}^{'}) = \frac{1}{γ} (V^{*} (s_{i - 1}) - R (s_{i - 1})) - V^{*} (s_{i - 1}) = \frac{1}{γ} ((1 - γ) V^{*} (s_{i - 1}) - R (s_{i - 1}))$
Plugging this into the original equation recovers the divergence to negative infinity that we saw before.
But let’s assume that we just do a constant scaling to avoid this divergence:
$R_{AUP} (s, a) = R (s) - λ max (V^{*} (T (s, a)) - V^{*} (T (s, \emptyset)), 0)$
Then for an arbitrary trajectory (assuming that the chosen actions are no worse than $\emptyset$ ), we get $R_{AUP} (s_{i}, a_{i}) = R (s) - λ (V^{*} (s_{i + 1}) - V^{*} (s_{i})) = R (s) - (λ (V^{*} (s_{i + 1})) + (λ V^{*} (s_{i}))$
The total reward across the trajectory is then $(\infty \sum i = 0 γ^{i} R (s_{i})) - λ (\infty \sum i = 1 γ^{i - 1} V^{*} (s_{i})) + λ (\infty \sum i = 0 γ^{i} V^{*} (s_{i}))$
$= (\infty \sum i = 0 γ^{i} R (s_{i})) - λ V^{*} (s_{0}) - λ \infty \sum i = 1 γ^{i} (1 - γ) V^{*} (s_{i})$
The $λ V^{*} (s_{0})$ and $R (s_{0})$ are constants and so don’t matter for selecting policies, so I’m going to throw them out:
$= (\infty \sum i = 1 γ^{i} [R (s_{i}) - λ (1 - γ) V^{*} (s_{i})])$
So in deterministic environments with state-based rewards where $\emptyset$ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty $Penalty (s) = k V^{*} (s)$ for some constant $k$ ; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to $R$ ). Again, this seems much more like satisficing or quantilization than impact / power measurement.
What links here?