The stepwise inaction baseline with inaction rollouts already uses the same policy for ′t and rollouts, and yet it is not the inaction baseline.
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
Why not set s′t=st−1?
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
Here’s three sentences that might illuminate their respective paragraph. If they don’t, ask again.
The stepwise inaction baseline with inaction rollouts already uses the same policy for s′t and rollouts, and yet it is not the inaction baseline.
Why not set s′t=st−1?
Why not subtract ωD from every R (in a fixpointy way)?
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
The last point I don’t understand at all.