Vika comments on Stepwise inaction and non-indexical impact measures

Vika 18 Feb 2020 12:24 UTC
LW: 4 AF: 2
AF
I don’t think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don’t need to tell apart other states that are not the starting state.
I agree that we need to compare to the penalty if the subagent is not created—I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).
As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the ‘winding road’ problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.
- Stuart_Armstrong 18 Feb 2020 14:59 UTC
  LW: 2 AF: 1
  AF Parent
  Alas, the inaction rollouts don’t seem to fix the problem: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb