When you take the agent off-distribution, offer it several proxies for in-distribution reinforcement. When you offer these such that going out of your way for one proxy detours you from going after a different proxy, and if you can modulate which proxy the agent detours for (by bringing some proxy much closer to the agent, say), you learned that the agent must care some about all those proxies it pursues at cost. If the agent hasn’t come to value a proxy at all, then it will never take a detour to get to that proxy.
When you take the agent off-distribution, offer it several proxies for in-distribution reinforcement. When you offer these such that going out of your way for one proxy detours you from going after a different proxy, and if you can modulate which proxy the agent detours for (by bringing some proxy much closer to the agent, say), you learned that the agent must care some about all those proxies it pursues at cost. If the agent hasn’t come to value a proxy at all, then it will never take a detour to get to that proxy.