As I’ve previously said in person, this is one of the best technical articulations of Goodhart I’ve encountered. Really glad you made it into a paper!
It’s worth mentioning that I think the ‘angle between reward functions’ in the occupancy space should relate somewhat neatly to the ‘distances between reward functions’ thread of research (‘reward function theory’?) including stuff Joar has worked on, and my upcoming paper on delegation games.
As I’ve previously said in person, this is one of the best technical articulations of Goodhart I’ve encountered. Really glad you made it into a paper!
It’s worth mentioning that I think the ‘angle between reward functions’ in the occupancy space should relate somewhat neatly to the ‘distances between reward functions’ thread of research (‘reward function theory’?) including stuff Joar has worked on, and my upcoming paper on delegation games.
That is right! See this paper.
Thanks, amended the link in my comment to point to this updated version