You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
I doubt that he’s assuming that.
To highlight the problem, imagine an intelligent being that wants to correctly interpret and follow the interpretation of an instruction written down on a piece of paper in English.
Now the question is, what is this being’s terminal goal? Here are some possibilities:
(1) The correct interpretation of the English instruction.
(2) Correctly interpreting and following the English instruction.
(3) The correct interpretation of 2.
(4) Correctly interpreting and following 2.
(5) The correct interpretation of 4.
(6) …
Each of the possibilities is one level below its predecessor. In other words, possibility 1 depends on 2, which in turn depends on 3, and so on.
The premise is that you are in possession of an intelligent agent that you are asking to do something. The assumption made by AI risk advocates is that this agent would interpret any instruction in some perverse manner. The counterargument is that this contradicts the assumption that this agent was supposed to be intelligent in the first place.
Now the response to this counterargument is to climb down the assumed hierarchy of hard-coded instructions and to claim that without some level N, which supposedly is the true terminal goal underlying all behavior, the AI will just optimize for the perverse interpretation.
Yes, the the AI is a deterministic machine. Nobody doubts this. But the given response also works against the perverse interpretation. To see why, first realize that if the AI is capable of self-improvement, and able to take over the world, then it is, hypothetically, also capable to arrive at an interpretation that is as good as one which a human being would be capable of arriving at. Now, since by definition, the AI has this capability, it will either use it selectively or universally.
The question here becomes why the AI would selectively abandon this capability when it comes to interpreting the highest level instructions. In other words, without some underlying level N, without some terminal goal which causes the AI to adopt a perverse interpretation, the AI would use its intelligence to interpret the highest level goal correctly.
I doubt that he’s assuming that.
To highlight the problem, imagine an intelligent being that wants to correctly interpret and follow the interpretation of an instruction written down on a piece of paper in English.
Now the question is, what is this being’s terminal goal? Here are some possibilities:
(1) The correct interpretation of the English instruction.
(2) Correctly interpreting and following the English instruction.
(3) The correct interpretation of 2.
(4) Correctly interpreting and following 2.
(5) The correct interpretation of 4.
(6) …
Each of the possibilities is one level below its predecessor. In other words, possibility 1 depends on 2, which in turn depends on 3, and so on.
The premise is that you are in possession of an intelligent agent that you are asking to do something. The assumption made by AI risk advocates is that this agent would interpret any instruction in some perverse manner. The counterargument is that this contradicts the assumption that this agent was supposed to be intelligent in the first place.
Now the response to this counterargument is to climb down the assumed hierarchy of hard-coded instructions and to claim that without some level N, which supposedly is the true terminal goal underlying all behavior, the AI will just optimize for the perverse interpretation.
Yes, the the AI is a deterministic machine. Nobody doubts this. But the given response also works against the perverse interpretation. To see why, first realize that if the AI is capable of self-improvement, and able to take over the world, then it is, hypothetically, also capable to arrive at an interpretation that is as good as one which a human being would be capable of arriving at. Now, since by definition, the AI has this capability, it will either use it selectively or universally.
The question here becomes why the AI would selectively abandon this capability when it comes to interpreting the highest level instructions. In other words, without some underlying level N, without some terminal goal which causes the AI to adopt a perverse interpretation, the AI would use its intelligence to interpret the highest level goal correctly.