Matthew Barnett comments on Hackable Rewards as a Safety Valve?

Matthew Barnett 10 Sep 2019 23:46 UTC
LW: 5 AF: 3
AF
The root issue is that Reward ≠ Utility
Agreed. When I wrote $U (π)$ I meant it as shorthand for $U (x) | π$ , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies.
That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.