Also if you give edit access to the AI’s mind then a sufficiently smart AI whose reward is reducing other agent’s rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment.
If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn’t modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it.
If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends.
I think the difference between “caring about stuff in the environment” and “caring about the reward signal itself” can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture.
In humans, maintenance of final goals can be explained with a thought experiment. Suppose a man named “Gandhi” has a pill that, if he took it, would cause him to want to kill people. This Gandhi is currently a pacifist: one of his explicit final goals is to never kill anyone. Gandhi is likely to refuse to take the pill, because Gandhi knows that if in the future he wants to kill people, he is likely to actually kill people, and thus the goal of “not killing people” would not be satisfied.
However, in other cases, people seem happy to let their final values drift. Humans are complicated, and their goals can be inconsistent or unknown, even to themselves.
Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill?
I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn’t be. Which demonstrates that humans do care about other things than their own “reward signal”.
In a way, a (properly-programmed) AI would be more “principled” than humans. It wouldn’t lie to itself just to make itself feel better. It wouldn’t change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being “principled” is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).
If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn’t modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it.
If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends.
I think the difference between “caring about stuff in the environment” and “caring about the reward signal itself” can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture.
Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill?
I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn’t be. Which demonstrates that humans do care about other things than their own “reward signal”.
In a way, a (properly-programmed) AI would be more “principled” than humans. It wouldn’t lie to itself just to make itself feel better. It wouldn’t change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being “principled” is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).