Okay, let’s annotate each A action with the policy that’s being followed/reward that’s being maximized. (And remember that lying is illegal.)
Iff agent A follows π_0, preserve A’s ability to maximise R.
Then A would be bound to follow π_0 to preserve its ability to maximize R, no? Assuming that to compute s’ from s, we follow π_0 instead of the last action.
Assuming that to compute s’ from s, we follow π_0 instead of the last action.
The standard definition of the penalty uses ∅ as the action used to compute s′. If the same policy is used to compute the s′ and for the rollouts—which is ideal—then the stepwise inaction/π0 becomes an inaction/π0 baseline (so section 3 is relevant here).
Section 2.2 is relevant when different policies are used to compute s′ and to rollout from s. In that situation, the subagent can ultimately have no restrictions at all (basically, this section shows why you want to use the same policy for both purposes).
It’s only equal to the inaction baseline on the first step. It has the step of divergence always be the last step.
Note that the stepwise pi0 baseline suggests using different baselines per auxiliary reward, namely the action that maximizes that auxiliary reward. Or equivalently, using the stepwise inaction baseline where the effect of inaction is that no time passes.
I’ll also remind here that it looks like instead of merely maximizing the auxiliary reward as a baseline, we ought to also apply an impact penalty to compute the baseline.
The stepwise inaction baseline with inaction rollouts already uses the same policy for ′t and rollouts, and yet it is not the inaction baseline.
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
Why not set s′t=st−1?
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
Okay, let’s annotate each A action with the policy that’s being followed/reward that’s being maximized. (And remember that lying is illegal.)
Then A would be bound to follow π_0 to preserve its ability to maximize R, no? Assuming that to compute s’ from s, we follow π_0 instead of the last action.
The standard definition of the penalty uses ∅ as the action used to compute s′. If the same policy is used to compute the s′ and for the rollouts—which is ideal—then the stepwise inaction/π0 becomes an inaction/π0 baseline (so section 3 is relevant here).
Section 2.2 is relevant when different policies are used to compute s′ and to rollout from s. In that situation, the subagent can ultimately have no restrictions at all (basically, this section shows why you want to use the same policy for both purposes).
It’s only equal to the inaction baseline on the first step. It has the step of divergence always be the last step.
Note that the stepwise pi0 baseline suggests using different baselines per auxiliary reward, namely the action that maximizes that auxiliary reward. Or equivalently, using the stepwise inaction baseline where the effect of inaction is that no time passes.
I’ll also remind here that it looks like instead of merely maximizing the auxiliary reward as a baseline, we ought to also apply an impact penalty to compute the baseline.
I’m not following you here. Could you put this into equations/examples?
Here’s three sentences that might illuminate their respective paragraph. If they don’t, ask again.
The stepwise inaction baseline with inaction rollouts already uses the same policy for s′t and rollouts, and yet it is not the inaction baseline.
Why not set s′t=st−1?
Why not subtract ωD from every R (in a fixpointy way)?
In this case, it is, because the agent A will only do ∅ from then on, to zero out the subsequent penalties.
It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
The last point I don’t understand at all.