Rohin Shah comments on Dynamic inconsistency of the inaction and initial state baseline

Rohin Shah 13 Jul 2020 22:49 UTC
LW: 4 AF: 3
AF
Planned summary for the Alignment Newsletter:
In a fixed, stationary environment, we would like our agents to be time-consistent: that is, they should not have a positive incentive to restrict their future choices. However, impact measures like <@AUP@>(@Towards a New Impact Measure@) calculate impact by looking at what the agent could have done otherwise. As a result, the agent has an incentive to change what this counterfactual is, in order to reduce the penalty it receives, and it might accomplish this by restricting its future choices. This is demonstrated concretely with a gridworld example.
Planned opinion:
It’s worth noting that measures like AUP do create a Markovian reward function, which typically leads to time consistent agents. The reason that this doesn’t apply here is because we’re assuming that the restriction of future choices is “external” to the environment and formalism, but nonetheless affects the penalty. If we instead have this restriction “inside” the environment, then we will need to include a state variable specifying whether the action set is restricted or not. In that case, the impact measure would create a reward function that depends on that state variable. So another way of stating the problem is that if you add the ability to restrict future actions to the environment, then the impact penalty leads to a reward function that depends on whether the action set is restricted, which intuitively we don’t want. (This point is also made in this followup post.)
- Stuart_Armstrong 14 Jul 2020 16:52 UTC
  LW: 4 AF: 3
  AF Parent
  Good, cheers!