The stepwise inaction baseline with inaction rollouts already uses the same policy for ′t and rollouts, and yet it is not the inaction baseline.
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
Why not set s′t=st−1?
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
The last point I don’t understand at all.