Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.
Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:
Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))
Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)
which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)
So, you must prefer the always-∅ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.
----
Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?
Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.
Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)
As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))
Plugging this into the original equation recovers the divergence to negative infinity that we saw before.
But let’s assume that we just do a constant scaling to avoid this divergence:
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)
Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))
The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))
=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)
The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:
=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])
So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.
Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)V∗(s)−Q∗(s,∅)
Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:
RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=2V∗(si)−V∗(s′i)V∗(si−1)−Q∗(si−1,∅))
Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))
Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)
which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)
So, you must prefer the always-∅ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.
----
Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?
Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.
Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)
As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))
Plugging this into the original equation recovers the divergence to negative infinity that we saw before.
But let’s assume that we just do a constant scaling to avoid this divergence:
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)
Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))
The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))
=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)
The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:
=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])
So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.