Garrett Baker comments on Measuring Coherence of Policies in Toy Environments

Garrett Baker 18 Mar 2024 22:06 UTC
7 points
0
I have thought some about how to measure the “coherence” of a policy in an MDP. One nice approach I came to was summing up the absolute values of the real parts of the eigenvalues of the corresponding MDP matrix with & without the policy present. The lower this is, the more coherent a policy. It seemed to work well for my purposes, but I haven’t subjected it to much strain yet.
- Garrett Baker 18 Mar 2024 22:06 UTC
  5 points
  0
  Parent
  The eigenvalues are a measure of how quickly the MDP reaches a steady state. This works when you know the goals of your network are to achieve a particular state in the MDP as fast as possible, and stay there.
  
  Edit: I think this also works if your model has the goal to achieve a particular distribution of states too.
  - dx26 19 Mar 2024 1:30 UTC
    1 point
    0
    Parent
    Right, I think this somewhat corresponds to the “how long it takes a policy to reach a stable loop” (the “distance to loop” metric), which we used in our experiments.
    
    What did you use your coherence definition for?
    - Garrett Baker 19 Mar 2024 1:42 UTC
      4 points
      0
      Parent
      Its a long story, but I wanted to see what the functional landscape of coherence looked like for goal misgeneralizing RL environments after doing essential dynamics. Results forthcoming.
    - Garrett Baker 19 Mar 2024 16:19 UTC
      3 points
      0
      Parent
      They are related, but time-to-loop fails when there are many loops a random policy is likely to access. For example, if a “do nothing” action is the default, your agent will immediately enter a loop, but the sum of the absolute values of the real parts of the eigenvales will be very high (the number of states in the environment).