Stuart_Armstrong comments on Subagents and impact measures, full and fully illustrated

Stuart_Armstrong 26 Feb 2020 11:32 UTC
2 points

The stepwise inaction baseline with inaction rollouts already uses the same policy for $_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.

In this case, it is, because the agent $A$ will only do $\emptyset$ from then on, to zero out the subsequent penalties.

Why not set $s_{t}^{'} = s_{t - 1}$ ?

It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?

The last point I don’t understand at all.