That’s what would make the check nontrivial: IIUC there exist policies which are not consistent with any assignment of values satisfying that Bellman equation.
Ah, I see. Yeah, good point. So let’s imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)
Now, there’s some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:
When going to a terminal state, does it always take the shortest path?
For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
Do these pairings allow us to rank all terminal states?
This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.
Note that going to many different terminal states from different starting points doesn’t necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.
Ah, I see. Yeah, good point. So let’s imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)
Now, there’s some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:
When going to a terminal state, does it always take the shortest path?
For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
Do these pairings allow us to rank all terminal states?
This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.
Note that going to many different terminal states from different starting points doesn’t necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.