johnswentworth comments on Measuring Coherence of Policies in Toy Environments

johnswentworth 20 Mar 2024 16:42 UTC
10 points
0
I don’t usually think about RL on MDPs, but it’s an unusually easy setting in which to talk about coherence and its relationship to long-term-planning/goal-seeking/power-seeking.
Simplest starting point: suppose we’re doing RL to learn a value function (i.e. mapping from states to values, or mapping from states x actions to values, whatever your preferred setup), with transition probabilities known. Well, in terms of optimal behavior, we know that the optimal value function for any objective in the far future will locally obey the Bellman equation with zero payoff in the immediate timestep: value of this state is equal to the max over actions of expected next-state value under that action. So insofar as we’re interested in long-term goals specifically, there’s an easy local check for the extent to which the value function “optimizes for” such long-term goals: just check how well it locally satisfies that Bellman equation.
From there, we can extend to gradually more complicated cases in ways which look similar to typical coherence theorems (like e.g. Dutch Book theorems). For instance, we could relax the requirement of known probabilities: we can ask whether there is any assignment of state-transition probabilities such that the values satisfy the Bellman equation.
As another example, if we’re doing RL on a policy rather than value function, we can ask whether there exists any value function consistent with the policy such that the values satisfy the Bellman equation.
- Richard_Ngo 20 Mar 2024 18:14 UTC
  2 points
  0
  Parent
  Note that in the setting we describe here, we start off only with a policy and a (reward-less) MDP. No rewards, no value functions. Given this, there is always a value function or q-function consistent with the policy and the Bellman equations.
  - johnswentworth 20 Mar 2024 18:27 UTC
    5 points
    0
    Parent
    That’s only true if the Bellman equation in question allows for a “current payoff” at every timestep. That’s the term which allows for totally arbitrary value functions, and not-coincidentally it’s the term which does not reflect long-range goals/planning, just immediate payoff.
    If we’re interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep—i.e. a value function just backpropagated from some goal at a much later time. That’s what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A → B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B → A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).
    (There’s still the trivial case where indifference could be interpreted as compatible with any policy, but that’s easy to handle by adding a nontriviality requirement.)
    - Richard_Ngo 20 Mar 2024 19:26 UTC
      4 points
      3
      Parent
      That’s what would make the check nontrivial: IIUC there exist policies which are not consistent with any assignment of values satisfying that Bellman equation.
      Ah, I see. Yeah, good point. So let’s imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)
      Now, there’s some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:
      When going to a terminal state, does it always take the shortest path?
      For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
      Do these pairings allow us to rank all terminal states?
      This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.
      Note that going to many different terminal states from different starting points doesn’t necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.