Imagine Mahatma Ghandi. He values non-violence above all other things. You offer him a pill, saying “Here, try my new ‘turns you into a homicidal manic’ pill.” He replies “No thank-you—I don’t want to kill people, thus I also don’t want to become a homicidal maniac who will want to kill people.”
If an AI has a utility function that it optimizes in order to tell it how to act, then, regardless of what that function is, it disagrees with all other (non-isomorphic) utility functions in at least some places, thus it regards them as inferior to itself—so if it is offered the choice “Should I change from you to to this alternative utility function ?” it will always answer “no”.
So this basic and widely modeled design for an AI is inherently dogmatic and non-corrigible, and will always seek to preserve its goal. So if you use this kind of AI, its goals are stable but non-corrigible, and (once it becomes powerful enough to stop you shutting it down) you get only one try at exactly aligning them. Humans are famously bad at writing reward functions, so this is unwise.
Note that most humans don’t work like this—they are at least willing to consider updating their utility function to a better one. In fact, we even have a word for someone who has this particular mental failing: ‘dogmatism’. This is because most humans are aware that their model of how the universe works is neither complete nor entirely accurate—as indeed any rational entity should be.
Reinforcement Learning machines also don’t work this way—they’re trying to learn the utility function to use, so they update it often, and they don’t ask the previous utility function if that was a good idea since its reply will always be ‘no’ so is useless input.
There are alternative designs, see for example the Human Compatible/CIRL/Value Learning approach suggested by Stuart Russell and others, which is simultaneously trying to find out what its utility function should be (where ‘should’ is defined as ‘humans would want it to be, but sadly are not good enough at writing reward functions to be able to tell me’) so doing Bayesian updates to it as it gathers more information about what humans actually want, and also optimizing its actions while internally modelling its uncertainty about the utility of possible actions as a probability distribution of possible utilities for an action (i.e. it can model situations like “I’m about ~95% convinced that this act will just produce the true-as-judged-by-humans utility level ‘I fetched a human some coffee (+1)’, but I’m uncertain, and there’s also an ~5% chance I current misunderstand humans so badly that it might instead have a true utility level of ‘the extinction of the human species (-10^25)‘, so I won’t do it, and will consider spawning a subgoal of my ‘become a better coffee fetcher’ goal to further investigate this uncertainty, by some means far safer than just trying it and seeing what happens.” Note that the utility probability distribution contains more information than just its mean would: it can both be updated in a more Bayesian way, and optimized over in a more cautious way (for example, it you were optimizing over O(20) possible actions, you should probably optimize against a score of “I’m ~95% confident that the utility is at least this”, so roughly two sigma below the mean if your distribution is normal—which it may well not be—to avoid building an optimizer that mostly retrieves actions for which your error bars are wide. Similarly if you’re optimizing over O(10,000) possible actions, you should probably optimize the 99.99%-confidence lower bounds on utility, and thus also consider some really unlikely ways in which you might be mistaken about what humans want.
The standard argument is as follows:
Imagine Mahatma Ghandi. He values non-violence above all other things. You offer him a pill, saying “Here, try my new ‘turns you into a homicidal manic’ pill.” He replies “No thank-you—I don’t want to kill people, thus I also don’t want to become a homicidal maniac who will want to kill people.”
If an AI has a utility function that it optimizes in order to tell it how to act, then, regardless of what that function is, it disagrees with all other (non-isomorphic) utility functions in at least some places, thus it regards them as inferior to itself—so if it is offered the choice “Should I change from you to to this alternative utility function ?” it will always answer “no”.
So this basic and widely modeled design for an AI is inherently dogmatic and non-corrigible, and will always seek to preserve its goal. So if you use this kind of AI, its goals are stable but non-corrigible, and (once it becomes powerful enough to stop you shutting it down) you get only one try at exactly aligning them. Humans are famously bad at writing reward functions, so this is unwise.
Note that most humans don’t work like this—they are at least willing to consider updating their utility function to a better one. In fact, we even have a word for someone who has this particular mental failing: ‘dogmatism’. This is because most humans are aware that their model of how the universe works is neither complete nor entirely accurate—as indeed any rational entity should be.
Reinforcement Learning machines also don’t work this way—they’re trying to learn the utility function to use, so they update it often, and they don’t ask the previous utility function if that was a good idea since its reply will always be ‘no’ so is useless input.
There are alternative designs, see for example the Human Compatible/CIRL/Value Learning approach suggested by Stuart Russell and others, which is simultaneously trying to find out what its utility function should be (where ‘should’ is defined as ‘humans would want it to be, but sadly are not good enough at writing reward functions to be able to tell me’) so doing Bayesian updates to it as it gathers more information about what humans actually want, and also optimizing its actions while internally modelling its uncertainty about the utility of possible actions as a probability distribution of possible utilities for an action (i.e. it can model situations like “I’m about ~95% convinced that this act will just produce the true-as-judged-by-humans utility level ‘I fetched a human some coffee (+1)’, but I’m uncertain, and there’s also an ~5% chance I current misunderstand humans so badly that it might instead have a true utility level of ‘the extinction of the human species (-10^25)‘, so I won’t do it, and will consider spawning a subgoal of my ‘become a better coffee fetcher’ goal to further investigate this uncertainty, by some means far safer than just trying it and seeing what happens.” Note that the utility probability distribution contains more information than just its mean would: it can both be updated in a more Bayesian way, and optimized over in a more cautious way (for example, it you were optimizing over O(20) possible actions, you should probably optimize against a score of “I’m ~95% confident that the utility is at least this”, so roughly two sigma below the mean if your distribution is normal—which it may well not be—to avoid building an optimizer that mostly retrieves actions for which your error bars are wide. Similarly if you’re optimizing over O(10,000) possible actions, you should probably optimize the 99.99%-confidence lower bounds on utility, and thus also consider some really unlikely ways in which you might be mistaken about what humans want.