I agree I should re-do these all with examples. But, to answer your questions as far as possible:
What A will do: ∅ forever (maybe with some extra info for the SA).
What A would counterfactually do (to define Vk(s(t)t+τ)): π0 for τ turns (the π0 rollout, to define s(t)t+τ) followed by π∗R (to define the Vk aspect). Call this π∗0,τ,k.
The penalty is defined in terms of these Vk(s(t)t+τ), entirely counterfactual except for τ=0.
What SA must do: iff A follows π∗0,τ,k, then SA must ensure that Vk(s(t)t+τ)=Vk(s′(t)t+τ).
So a key question is: how early can SA detect if A is following π∗0,τ,k? If it can detect this immediately, then it can immediately implement the policy for ensuring that Vk(s(t)t+τ)=Vk(s′(t)t+τ) (for the stricter impact measure) or the much easier BVk(st)=BVk(s′t) (for the original impact measure).
If it has to wait τ turns to be sure, then it needs to preserve its power to control many different k.
Now, A will never actually follow π∗0,τ,k, so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of SA will determine the penalty; SA is rewarded for quickly gaining control of the rewards.
In practice, A will signal early on that it’s not following π∗0,τ,k. Maybe there’s a very small impact action, ∅ϵ, with a tiny penalty. Then A would like to use that to signal to SA that its not in the counterfactual world where it is following π∗0,τ,k. So long as π0‘s first action isn’t ∅ϵ, that signal is enough (of course, if π0’s first action isn’t ∅, then ∅ is a sufficient signal).
After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.
The one mild constraint remaining comes from s(t)t+0=st, which is not a counterfactually defined state (though Vk(st) is a counterfactually defined property of that state).
I’ll try and write up a further clarification of this.
I agree I should re-do these all with examples. But, to answer your questions as far as possible:
What A will do: ∅ forever (maybe with some extra info for the SA).
What A would counterfactually do (to define Vk(s(t)t+τ)): π0 for τ turns (the π0 rollout, to define s(t)t+τ) followed by π∗R (to define the Vk aspect). Call this π∗0,τ,k.
The penalty is defined in terms of these Vk(s(t)t+τ), entirely counterfactual except for τ=0.
What SA must do: iff A follows π∗0,τ,k, then SA must ensure that Vk(s(t)t+τ)=Vk(s′(t)t+τ).
So a key question is: how early can SA detect if A is following π∗0,τ,k? If it can detect this immediately, then it can immediately implement the policy for ensuring that Vk(s(t)t+τ)=Vk(s′(t)t+τ) (for the stricter impact measure) or the much easier BVk(st)=BVk(s′t) (for the original impact measure).
If it has to wait τ turns to be sure, then it needs to preserve its power to control many different k.
Now, A will never actually follow π∗0,τ,k, so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of SA will determine the penalty; SA is rewarded for quickly gaining control of the rewards.
In practice, A will signal early on that it’s not following π∗0,τ,k. Maybe there’s a very small impact action, ∅ϵ, with a tiny penalty. Then A would like to use that to signal to SA that its not in the counterfactual world where it is following π∗0,τ,k. So long as π0‘s first action isn’t ∅ϵ, that signal is enough (of course, if π0’s first action isn’t ∅, then ∅ is a sufficient signal).
After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.
The one mild constraint remaining comes from s(t)t+0=st, which is not a counterfactually defined state (though Vk(st) is a counterfactually defined property of that state).
I’ll try and write up a further clarification of this.