So if I understand correctly, optimal policies specifically have to be coherent in their decision-making when all information about which decision was made is destroyed, and only information about the outcome remains. The load-bearing part being:
Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A ; from the other the policy chooses B . This is an inconsistent revealed preference between A and B at time t : sometimes the policy has a revealed preference for A over B , sometimes for B over A .
Concrete example:
Start with the state diagram
We assign values to the final states, and then do
start from those values over final state and compute the best value achievable starting from each state at each earlier time. That’s just dynamic programming: V[S,t]=maxS′ reachable in next timestep from SV[S′,t+1] where V[S,T] are the values over final states.
and so the reasoning is that there is no coherent policy which chooses Prize Room A from the front door but chooses Prize Room B from the side door.
But then if we update the states to include information about the history, and say put +3 on “histories where we have gone straight”, we get
and in that case, the optimal policy will go to Prize Room A from the front door and Prize Room B from the side door. This happens because “Prize Room A from the front door” is not the same node as “Prize Room A from the side door” in this graph.
The coherence theorem in the post talks about how optimal models can’t make take alternate options when presented with the same choice based on their history, but for the choice to be “the same choice” you have to have merging paths on the graph, and if nodes contain their own history, paths will never merge.
Is that basically why only the final state is allowed to “count” under this proof, or am I still missing something?
So if I understand correctly, optimal policies specifically have to be coherent in their decision-making when all information about which decision was made is destroyed, and only information about the outcome remains. The load-bearing part being:
Concrete example:
Start with the state diagram
We assign values to the final states, and then do
and so the reasoning is that there is no coherent policy which chooses Prize Room A from the front door but chooses Prize Room B from the side door.
But then if we update the states to include information about the history, and say put +3 on “histories where we have gone straight”, we get
and in that case, the optimal policy will go to Prize Room A from the front door and Prize Room B from the side door. This happens because “Prize Room A from the front door” is not the same node as “Prize Room A from the side door” in this graph.
The coherence theorem in the post talks about how optimal models can’t make take alternate options when presented with the same choice based on their history, but for the choice to be “the same choice” you have to have merging paths on the graph, and if nodes contain their own history, paths will never merge.
Is that basically why only the final state is allowed to “count” under this proof, or am I still missing something?
Edited to add: link to legible version of final diagram
That all looks correct.