Notice that we used values over final state, and explicitly set incremental reward at earlier timesteps to zero. That was load-bearing: with arbitrary freedom to choose rewards at earlier timesteps, any policy is optimal for some nontrivial values/rewards. (Proof: just pick the rewards at timestep t to reward whatever the policy does enough to overwhelm future value/rewards.)
Do you expect that your methods would generalize over a utility function that was defined as the sum of some utility function over the state at some fixed intermediate timestamp t and some utility function over the final state? Naively, I would think one could augment the state space such that the entire state at time t became encoded in subsequent states, and the utility function in question could then be expressed soley as a utility function over the final state. But I don’t know if that strategy is “allowed”.
If this method is “allowed”, I don’t understand why this theorem doesn’t extend to systems where incremental reward is nonzero at arbitrary timesteps.
If this method is not “allowed”, does that mean that this particular coherence theorem only holds over policies which care only about the final state of the world, and agents which are coherent in this sense are not allowed to care about world histories and the world state is not allowed to contain information about its history?
You could extend it that way, and more generally you could extend it to sparse rewards. As the post says, coherence tells us about optimal policies away from the things which the goals care about directly. But in order for the theorem to say something substantive, there has to be lots of “empty space” where the incremental reward is zero. It’s in the empty space where coherence has substantive things to say.
So if I understand correctly, optimal policies specifically have to be coherent in their decision-making when all information about which decision was made is destroyed, and only information about the outcome remains. The load-bearing part being:
Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A ; from the other the policy chooses B . This is an inconsistent revealed preference between A and B at time t : sometimes the policy has a revealed preference for A over B , sometimes for B over A .
Concrete example:
Start with the state diagram
We assign values to the final states, and then do
start from those values over final state and compute the best value achievable starting from each state at each earlier time. That’s just dynamic programming: V[S,t]=maxS′ reachable in next timestep from SV[S′,t+1] where V[S,T] are the values over final states.
and so the reasoning is that there is no coherent policy which chooses Prize Room A from the front door but chooses Prize Room B from the side door.
But then if we update the states to include information about the history, and say put +3 on “histories where we have gone straight”, we get
and in that case, the optimal policy will go to Prize Room A from the front door and Prize Room B from the side door. This happens because “Prize Room A from the front door” is not the same node as “Prize Room A from the side door” in this graph.
The coherence theorem in the post talks about how optimal models can’t make take alternate options when presented with the same choice based on their history, but for the choice to be “the same choice” you have to have merging paths on the graph, and if nodes contain their own history, paths will never merge.
Is that basically why only the final state is allowed to “count” under this proof, or am I still missing something?
Do you expect that your methods would generalize over a utility function that was defined as the sum of some utility function over the state at some fixed intermediate timestamp t and some utility function over the final state? Naively, I would think one could augment the state space such that the entire state at time t became encoded in subsequent states, and the utility function in question could then be expressed soley as a utility function over the final state. But I don’t know if that strategy is “allowed”.
If this method is “allowed”, I don’t understand why this theorem doesn’t extend to systems where incremental reward is nonzero at arbitrary timesteps.
If this method is not “allowed”, does that mean that this particular coherence theorem only holds over policies which care only about the final state of the world, and agents which are coherent in this sense are not allowed to care about world histories and the world state is not allowed to contain information about its history?
You could extend it that way, and more generally you could extend it to sparse rewards. As the post says, coherence tells us about optimal policies away from the things which the goals care about directly. But in order for the theorem to say something substantive, there has to be lots of “empty space” where the incremental reward is zero. It’s in the empty space where coherence has substantive things to say.
So if I understand correctly, optimal policies specifically have to be coherent in their decision-making when all information about which decision was made is destroyed, and only information about the outcome remains. The load-bearing part being:
Concrete example:
Start with the state diagram
We assign values to the final states, and then do
and so the reasoning is that there is no coherent policy which chooses Prize Room A from the front door but chooses Prize Room B from the side door.
But then if we update the states to include information about the history, and say put +3 on “histories where we have gone straight”, we get
and in that case, the optimal policy will go to Prize Room A from the front door and Prize Room B from the side door. This happens because “Prize Room A from the front door” is not the same node as “Prize Room A from the side door” in this graph.
The coherence theorem in the post talks about how optimal models can’t make take alternate options when presented with the same choice based on their history, but for the choice to be “the same choice” you have to have merging paths on the graph, and if nodes contain their own history, paths will never merge.
Is that basically why only the final state is allowed to “count” under this proof, or am I still missing something?
Edited to add: link to legible version of final diagram
That all looks correct.