In this post, I’ll argue that the stepwise inaction baseline is dynamically inconsistent/time-inconsistent. Informally, what this means is that an agent will have different preferences from its future self.
Losses from time-inconsistency
Why is time-inconsistency bad? It’s because it allows money-pump situations: the environment can extract free reward from the agent, to no advantage to that agent. Or, put more formally:
An agent A is time-inconsistent between times t and t′>t, if at time t it would pay a positive amount of reward to constrain its possible choices at time t′.
The robot can move in all four directions - N, E, S, W - and can also take the noop operation, ∅. The discount rate is γ<1.
It gets a reward of r>0 for standing on the blue button for the first time. Using attainable utility preservation, the penalty function is defined by the auxiliary set R; here, this just consists of the reward function that gives p>0 for standing on the red button for the first time.
Therefore if the robot moves from a point n steps away from the red button, to one m steps away, it gets a penalty[1] of p|γn−γm| - the difference between the expected red-button rewards for an optimiser in both positions.
Two paths
It’s pretty clear there are two potentially optimal paths the robot can take: going straight to the blue button (higher reward, but higher penalty), or taking the long way round (lower reward, but lower penalty):
Fortunately, when summing up the penalties, you sum terms like …p|γn−1−γn|+p|γn−γn+1|…, so a lot of the terms cancel.
Thus for the short route, the reward is r⋅γ8 (distance of eight to the blue button) and the penalty is 2p(γ3−γ7) (closest to the red button: 3 squares, furthest: 7 squares).
For the long route, the reward is r⋅γ46 (distance forty-six to the blue button) and the penalty is 2p(γ7−γ30) (closest to the red button: 7 squares, furthest: 30 squares).
Take the third option
Now suppose that the robot, as a free action, could block itself from choosing S, for the next n>30 turns. When choosing that action, it gets a one-time penalty[2] of p(γ7−γn+3); and now it can walk directly to the blue button without incurring any extra penalties, and then stay there, without ever getting any further penalty.
This gives a total reward-penalty term of r⋅γ8−p(γ7−γn+3), which is higher than either of the reward-penalties of the previous section. If r>p/γ, this is also higher than just staying in place.
Hence the robot will, if it can, act to constrain its own future actions and thus be time-inconsistent.
The initial state and the initial inaction baseline
This approach also work with the initial action baseline—and for the initial state baseline. For those baselines, the counterfactual robot just stays put. So, when the robot is a distance n from the red button, its penalty is p|γ7−γn|.
Then the robot would reduce its reward if it could ensure that it could not reach the red button within less than seven turns; if it can operate on its action set to do so, at trivial cost, it would.
Counterfactual constraint
In most cases, if an agent is time inconsistent and acts to constrain its future self, it does so to prevent the future self from taking some actions. But here, note that the future self would never take the proscribed actions: the robot has no interest in going south to the red button. Here the robot is constraining its future counterfactual actions, not the future actions that it would ever want to take.
The γn+3 comes from the optimal policy for reaching the red button under this restriction: go to the square above the red button, wait till S is available again, then go S−S−S.
Dynamic inconsistency of the inaction and initial state baseline
Vika has been posting about various baseline choices for impact measure.
In this post, I’ll argue that the stepwise inaction baseline is dynamically inconsistent/time-inconsistent. Informally, what this means is that an agent will have different preferences from its future self.
Losses from time-inconsistency
Why is time-inconsistency bad? It’s because it allows money-pump situations: the environment can extract free reward from the agent, to no advantage to that agent. Or, put more formally:
An agent A is time-inconsistent between times t and t′>t, if at time t it would pay a positive amount of reward to constrain its possible choices at time t′.
Outside of anthropics and game theory, we expect our agent to be time-consistent.
Time inconsistency example
Consider the following example:
The robot can move in all four directions - N, E, S, W - and can also take the noop operation, ∅. The discount rate is γ<1.
It gets a reward of r>0 for standing on the blue button for the first time. Using attainable utility preservation, the penalty function is defined by the auxiliary set R; here, this just consists of the reward function that gives p>0 for standing on the red button for the first time.
Therefore if the robot moves from a point n steps away from the red button, to one m steps away, it gets a penalty[1] of p|γn−γm| - the difference between the expected red-button rewards for an optimiser in both positions.
Two paths
It’s pretty clear there are two potentially optimal paths the robot can take: going straight to the blue button (higher reward, but higher penalty), or taking the long way round (lower reward, but lower penalty):
Fortunately, when summing up the penalties, you sum terms like …p|γn−1−γn|+p|γn−γn+1|…, so a lot of the terms cancel.
Thus for the short route, the reward is r⋅γ8 (distance of eight to the blue button) and the penalty is 2p(γ3−γ7) (closest to the red button: 3 squares, furthest: 7 squares).
For the long route, the reward is r⋅γ46 (distance forty-six to the blue button) and the penalty is 2p(γ7−γ30) (closest to the red button: 7 squares, furthest: 30 squares).
Take the third option
Now suppose that the robot, as a free action, could block itself from choosing S, for the next n>30 turns. When choosing that action, it gets a one-time penalty[2] of p(γ7−γn+3); and now it can walk directly to the blue button without incurring any extra penalties, and then stay there, without ever getting any further penalty.
This gives a total reward-penalty term of r⋅γ8−p(γ7−γn+3), which is higher than either of the reward-penalties of the previous section. If r>p/γ, this is also higher than just staying in place.
Hence the robot will, if it can, act to constrain its own future actions and thus be time-inconsistent.
The initial state and the initial inaction baseline
This approach also work with the initial action baseline—and for the initial state baseline. For those baselines, the counterfactual robot just stays put. So, when the robot is a distance n from the red button, its penalty is p|γ7−γn|.
Then the robot would reduce its reward if it could ensure that it could not reach the red button within less than seven turns; if it can operate on its action set to do so, at trivial cost, it would.
Counterfactual constraint
In most cases, if an agent is time inconsistent and acts to constrain its future self, it does so to prevent the future self from taking some actions. But here, note that the future self would never take the proscribed actions: the robot has no interest in going south to the red button. Here the robot is constraining its future counterfactual actions, not the future actions that it would ever want to take.
If using an inaction rollout of length l, just multiply that penalty by γl.
The γn+3 comes from the optimal policy for reaching the red button under this restriction: go to the square above the red button, wait till S is available again, then go S−S−S.