If you’re thinking of a utility-maximizing agent, then it typically wouldn’t modify its own utility function. Instead I’m talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and “reward function” refers to the function computed by whatever is providing it with rewards.
So a utility maximizing agent, like a paperclip-maximizer, can think “If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility.” But this kind of reasoning isn’t available to a reward-maximizing agent, because it doesn’t normally have access to the reward function. Instead it can only be programmed to think thoughts like “If I do X, what will be my future expected rewards” and “If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high.” Not to mention “If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher.” (I’m anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)
Does this help?
ETA: Note that here I’m interpreting “hack” as “modify” or “tamper with”, but people sometimes use “reward hacking” to include “reward gaming” which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of “hack”, utility hacking would be quite possible if the utility function isn’t totally aligned with human values.
I’m on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).
Part of the confusion may be different notions of “wireheading”: seizing an external reward channel, vs actual self-modification. If you’re picturing the former, then I agree that the agent won’t hack its expectation operator. It’s the latter I’m concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?
Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn’t model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number—the robot no longer has any reason to worry about the rest of the world.
If you’re thinking of a utility-maximizing agent, then it typically wouldn’t modify its own utility function. Instead I’m talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and “reward function” refers to the function computed by whatever is providing it with rewards.
So a utility maximizing agent, like a paperclip-maximizer, can think “If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility.” But this kind of reasoning isn’t available to a reward-maximizing agent, because it doesn’t normally have access to the reward function. Instead it can only be programmed to think thoughts like “If I do X, what will be my future expected rewards” and “If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high.” Not to mention “If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher.” (I’m anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)
Does this help?
ETA: Note that here I’m interpreting “hack” as “modify” or “tamper with”, but people sometimes use “reward hacking” to include “reward gaming” which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of “hack”, utility hacking would be quite possible if the utility function isn’t totally aligned with human values.
I’m on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).
Part of the confusion may be different notions of “wireheading”: seizing an external reward channel, vs actual self-modification. If you’re picturing the former, then I agree that the agent won’t hack its expectation operator. It’s the latter I’m concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?
Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn’t model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number—the robot no longer has any reason to worry about the rest of the world.