Stuart_Armstrong comments on (In)action rollouts

Stuart_Armstrong 18 Feb 2020 17:36 UTC
LW: 4 AF: 2
AF
I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What $A$ will do: $\emptyset$ forever (maybe with some extra info for the $S A$ ).

What $A$ would counterfactually do (to define $V_{k} (s_{t + τ}^{(t)})$ ): $π_{0}$ for $τ$ turns (the $π_{0}$ rollout, to define $s_{t + τ}^{(t)}$ ) followed by $π_{R}^{*}$ (to define the $V_{k}$ aspect). Call this $π_{0, τ, k}^{*}$ .

The penalty is defined in terms of these $V_{k} (s_{t + τ}^{(t)})$ , entirely counterfactual except for $τ = 0$ .

What $S A$ must do: iff $A$ follows $π_{0, τ, k}^{*}$ , then $S A$ must ensure that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ .

So a key question is: how early can $S A$ detect if $A$ is following $π_{0, τ, k}^{*}$ ? If it can detect this immediately, then it can immediately implement the policy for ensuring that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ (for the stricter impact measure) or the much easier $B V_{k} (s_{t}) = B V_{k} (s_{t}^{'})$ (for the original impact measure).

If it has to wait $τ$ turns to be sure, then it needs to preserve its power to control many different $k$ .

Now, $A$ will never actually follow $π_{0, τ, k}^{*}$ , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of $S A$ will determine the penalty; $S A$ is rewarded for quickly gaining control of the rewards.

In practice, $A$ will signal early on that it’s not following $π_{0, τ, k}^{*}$ . Maybe there’s a very small impact action, $\emptyset^{ϵ}$ , with a tiny penalty. Then $A$ would like to use that to signal to $S A$ that its not in the counterfactual world where it is following $π_{0, τ, k}^{*}$ . So long as $π_{0}$ ‘s first action isn’t $\emptyset^{ϵ}$ , that signal is enough (of course, if $π_{0}$ ’s first action isn’t $\emptyset$ , then $\emptyset$ is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.

The one mild constraint remaining comes from $s_{t + 0}^{(t)} = s_{t}$ , which is not a counterfactually defined state (though $V_{k} (s_{t})$ is a counterfactually defined property of that state).

I’ll try and write up a further clarification of this.