Note that in the setting we describe here, we start off only with a policy and a (reward-less) MDP. No rewards, no value functions. Given this, there is always a value function or q-function consistent with the policy and the Bellman equations.
That’s only true if the Bellman equation in question allows for a “current payoff” at every timestep. That’s the term which allows for totally arbitrary value functions, and not-coincidentally it’s the term which does not reflect long-range goals/planning, just immediate payoff.
If we’re interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep—i.e. a value function just backpropagated from some goal at a much later time. That’s what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A → B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B → A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).
(There’s still the trivial case where indifference could be interpreted as compatible with any policy, but that’s easy to handle by adding a nontriviality requirement.)
That’s what would make the check nontrivial: IIUC there exist policies which are not consistent with any assignment of values satisfying that Bellman equation.
Ah, I see. Yeah, good point. So let’s imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)
Now, there’s some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:
When going to a terminal state, does it always take the shortest path?
For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
Do these pairings allow us to rank all terminal states?
This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.
Note that going to many different terminal states from different starting points doesn’t necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.
Note that in the setting we describe here, we start off only with a policy and a (reward-less) MDP. No rewards, no value functions. Given this, there is always a value function or q-function consistent with the policy and the Bellman equations.
That’s only true if the Bellman equation in question allows for a “current payoff” at every timestep. That’s the term which allows for totally arbitrary value functions, and not-coincidentally it’s the term which does not reflect long-range goals/planning, just immediate payoff.
If we’re interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep—i.e. a value function just backpropagated from some goal at a much later time. That’s what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A → B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B → A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).
(There’s still the trivial case where indifference could be interpreted as compatible with any policy, but that’s easy to handle by adding a nontriviality requirement.)
Ah, I see. Yeah, good point. So let’s imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)
Now, there’s some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:
When going to a terminal state, does it always take the shortest path?
For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
Do these pairings allow us to rank all terminal states?
This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.
Note that going to many different terminal states from different starting points doesn’t necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.