johnswentworth comments on Hackable Rewards as a Safety Valve?

johnswentworth 11 Sep 2019 0:08 UTC
LW: 3 AF: 1
AF
I’m on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).
Part of the confusion may be different notions of “wireheading”: seizing an external reward channel, vs actual self-modification. If you’re picturing the former, then I agree that the agent won’t hack its expectation operator. It’s the latter I’m concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?
Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn’t model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number—the robot no longer has any reason to worry about the rest of the world.