OliverHayman comments on Goodhart’s Law in Reinforcement Learning

OliverHayman 16 Oct 2023 21:59 UTC
4 points
−1
An important part of the paper that I think is easily missed, and useful for people doing work on distances between reward vectors:

There is some existing literature on defining distances between reward functions (e.g., see Gleave et. al.). However, all proposed distances are only pseudometrics.

A bit about distance functions:
Commonly, two reward functions are defined to be the same (e.g., see Skalse et. al.) if they’re equivalent up to scaling the reward function and introducing potential shaping. By the latter, I mean that two reward functions are the same if one is $R$ and the other is of the form $R + γ * Φ (next state) - Φ (current state)$ for some function $Φ$ and discount $γ$ . This is because in Ng. et. al. it is shown these make up all reward vectors that we know give the same optimal policy as the original reward across all environments (with the same state/action space).
This leads us to the following important claim:

Projecting reward vectors onto $Ω$ and taking the angle between them is a perfect distance metric according to these desiderata.

Why: It can easily be shown it’s a metric, provided it’s well-defined with the equivalence relation. It can also be shown that the locus of reward functions that give the same projection as $R$ onto $Ω$ is exactly the set of potential-shaped reward functions. Then the claim pretty clearly follows.

In particular, this seems like the most natural “true” reward metric, and I’m not sure any other “true” metrics have even been proposed before this.
- Oliver Sourbut 17 Oct 2023 10:21 UTC
  1 point
  0
  Parent
  I remarked on this claim ~~while the paper was in review~~ when I was asked to give some feedback on the paper [it’s still under official review I think]. Some of the earlier proposed metrics are in fact full metrics in the relevant sense, provided they have full coverage on the reward space. e.g. STARC distances are metrics on the quotient space of reward functions by the equivalences, aren’t they, if the distance metric has full coverage? (Which is the same as what this project-then-take-angle metric is.) In particular, I think the angle relates 1-1 with the L2 distance on a suitably canonicalised unit ball from STARC/EPIC.
  - OliverHayman 17 Oct 2023 12:20 UTC
    2 points
    0
    Parent
    You’re right. For some reason, I thought EPIC was a pseudometric on the quotient space and not on the full reward space.
    
    I think this makes the thing I’m saying much less useful.
    - Oliver Sourbut 17 Oct 2023 21:03 UTC
      2 points
      0
      Parent
      I still think this is an important point, and I’ve been thinking there should be a bloggy write-up of the maths in this area on LW/AF! Maybe you (or I, or Jacek, or Charlie, or Joar, or whoever...) could make that happen.
      
      The original EPIC definition, and the STARC defs, can be satisfied while yielding only a pseudometric on the quotient space. But they also include many full (quotient) metrics, and the (kinda default?) L2 choice (assuming full-support weighting) yields a full metric.