Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
You don’t want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as ‘oh, it’s changed our rewards, reset it again’, poses some risk.
Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
You don’t want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as ‘oh, it’s changed our rewards, reset it again’, poses some risk.