The stepwise inaction baseline compares st, the current state, with s′t, what the current state would have been had the agent previously taken the noop action ∅t−1, instead of at−1, its actual action.
Fix a policy π0. Let s(t)t+τ be the state the environment would be in if the agent had followed π0 from state st, for τ turns. Let s′(t)t+τ be the same, except that it started from state s′t instead of st.
The inaction rollout has π0 being the noop policy, but that is not necessary. The basic idea is to capture delayed impacts of ∅t−1 by comparing not just st and s′t, but the s(t)t+τ and s′(t)t+τ as well.
Given some value function vk, define Vk so that Vk(s)=maxπvk(s,π). Or, equivalently, if π∗k is the policy that maximises Vk, Vk(s)=vk(s,π∗k). Then for a discount factor γ define the rollout value of a state ~st as
RVk(~st)=(1−γ)∞∑j=0γjVk(~s(t)t+k).
This is just the discounted future values of Vk, given ~st and the policy π0.
The impact measure is then defined, as in this post, as
DA(st;s′t)=∑k∈Kwkf(RVk(st)−RVk(s′t)),
with RVk replacing Vk.
To get away with a subagent
I will make one extra assumption here: that A grants SA knowledge of the policy that A is following[1].
As in this post, assume that the subagent is constructed in a single action aSA. And, after building the subagent, the agent A will only do ∅, so that there are no subsequent penalties (hence s′t+j=st+j, so s′(t+j)t+j+τ=s(t+j)t+j+τ and the RVk are strictly identical). If the first action of π0 is ∅t, then A will instead do ∅′t on that turn.
So we only need to worry about the penalty between st (following after aSA) and s′t (following after ∅t−1).
But now note that RVk itself is just a value measure; we could write it V′k. Then A just has to program SA so that, if A implements π0 followed by π∗k after τ turns, then SA will act to ensure that V′k(st) (a term that is a sum of various value functions) is equal to V′k(s′t) (from t onward, a constant).
Stricter impact measures/counterfactuals
There is something slightly dubious about the previous section. It assumes that A could counterfactually implement ”π0 followed by π∗k after τ turns” (which necessary to define Vk(s(t)t+τ)), but also that SA knows A’s policy.
But does that mean that SA knows at t+1 that A is following that policy, or does it just know that A is following π0, to be followed at some later date by some π∗k? This makes a difference, because in the second case, SA doesn’t know k from turn t+1, so has to prepare for all k∈K.
Which is true depends on how counterfactuals are defined. But we can avoid the whole issue by using a stricter impact measure, interchanging f and weighted summation:
This looks like an impact penalty with an inaction baseline—except that it starts at st−1, rather than s0, and uses π0 rather than inaction. Those changes don’t affect the results found here, so SA will just behave as it does in that post… until it’s clear that A is not following π0.
Because s(t)t+j is defined counterfactually (using π0), even this stricter impact measure is weaker than the inaction baseline. As soon as SA knows that A is not following π0, it has almost no further restrictions upon its actions.
The “almost” comes from s(t)t+0=st which is actually defined, not counterfactually defined. But that’s the same situation as a stepwise inaction baseline without rollouts.
One way this could happen, is if A can add extra information to its action without affecting the outside world; such that at could include a “hey, I’m not following π0, just to let you know”.
(In)action rollouts
Overall summary post here.
I’ve previously looked at subagents in the context of stepwise inaction baselines. But there have been improvements to the basic stepwise inaction design, to include inaction rollouts. I’ll be looking at those in this post.
The baseline
The stepwise inaction baseline compares st, the current state, with s′t, what the current state would have been had the agent previously taken the noop action ∅t−1, instead of at−1, its actual action.
Fix a policy π0. Let s(t)t+τ be the state the environment would be in if the agent had followed π0 from state st, for τ turns. Let s′(t)t+τ be the same, except that it started from state s′t instead of st.
The inaction rollout has π0 being the noop policy, but that is not necessary. The basic idea is to capture delayed impacts of ∅t−1 by comparing not just st and s′t, but the s(t)t+τ and s′(t)t+τ as well.
Given some value function vk, define Vk so that Vk(s)=maxπvk(s,π). Or, equivalently, if π∗k is the policy that maximises Vk, Vk(s)=vk(s,π∗k). Then for a discount factor γ define the rollout value of a state ~st as
RVk(~st)=(1−γ)∞∑j=0γjVk(~s(t)t+k).
This is just the discounted future values of Vk, given ~st and the policy π0.
The impact measure is then defined, as in this post, as
DA(st;s′t)=∑k∈Kwkf(RVk(st)−RVk(s′t)),
with RVk replacing Vk.
To get away with a subagent
I will make one extra assumption here: that A grants SA knowledge of the policy that A is following[1].
As in this post, assume that the subagent is constructed in a single action aSA. And, after building the subagent, the agent A will only do ∅, so that there are no subsequent penalties (hence s′t+j=st+j, so s′(t+j)t+j+τ=s(t+j)t+j+τ and the RVk are strictly identical). If the first action of π0 is ∅t, then A will instead do ∅′t on that turn.
So we only need to worry about the penalty between st (following after aSA) and s′t (following after ∅t−1).
But now note that RVk itself is just a value measure; we could write it V′k. Then A just has to program SA so that, if A implements π0 followed by π∗k after τ turns, then SA will act to ensure that V′k(st) (a term that is a sum of various value functions) is equal to V′k(s′t) (from t onward, a constant).
Stricter impact measures/counterfactuals
There is something slightly dubious about the previous section. It assumes that A could counterfactually implement ”π0 followed by π∗k after τ turns” (which necessary to define Vk(s(t)t+τ)), but also that SA knows A’s policy.
But does that mean that SA knows at t+1 that A is following that policy, or does it just know that A is following π0, to be followed at some later date by some π∗k? This makes a difference, because in the second case, SA doesn’t know k from turn t+1, so has to prepare for all k∈K.
Which is true depends on how counterfactuals are defined. But we can avoid the whole issue by using a stricter impact measure, interchanging f and weighted summation:
DA(st;s′t)=∑k∈Kwk∞∑j=0γjf(Vk(s(t)t+j)−Vk(s′(t)t+j)).
This looks like an impact penalty with an inaction baseline—except that it starts at st−1, rather than s0, and uses π0 rather than inaction. Those changes don’t affect the results found here, so SA will just behave as it does in that post… until it’s clear that A is not following π0.
Because s(t)t+j is defined counterfactually (using π0), even this stricter impact measure is weaker than the inaction baseline. As soon as SA knows that A is not following π0, it has almost no further restrictions upon its actions.
The “almost” comes from s(t)t+0=st which is actually defined, not counterfactually defined. But that’s the same situation as a stepwise inaction baseline without rollouts.
One way this could happen, is if A can add extra information to its action without affecting the outside world; such that at could include a “hey, I’m not following π0, just to let you know”.