Select the worlds whose world history is ambiently controlled by the agent, that is the ambient dependence is non-constant, the conclusion of which world-history is implemented by given world-program depends on which strategy we assume the agent implements. Then read out the utility of reward channel from that strategy in that world.
Hmm… This is problematic if the same world contains multiple agent-instances that received different rewards (by following the same strategy but encountering different observations). What is the utility of such a world? This is a necessary question of specifying the decision problem. Perhaps it is a point where the notion of reinforcement learning breaks.
How do you formalize this? I couldn’t figure it out when I tried this.
Select the worlds whose world history is ambiently controlled by the agent, that is the ambient dependence is non-constant, the conclusion of which world-history is implemented by given world-program depends on which strategy we assume the agent implements. Then read out the utility of reward channel from that strategy in that world.
Hmm… This is problematic if the same world contains multiple agent-instances that received different rewards (by following the same strategy but encountering different observations). What is the utility of such a world? This is a necessary question of specifying the decision problem. Perhaps it is a point where the notion of reinforcement learning breaks.