Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 12 May 2022 18:52 UTC
LW: 21 AF: 11
AF
Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
- Pattern 19 May 2022 19:14 UTC
  2 points
  Parent
  You don’t want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as ‘oh, it’s changed our rewards, reset it again’, poses some risk.