“If a situation were such, that the only two practical options were to decide between (in the AI’s opinion) overriding the programmer’s opinion via manipulation, or letting something terrible happen that is even more against the AI’s supergoal than violating the ‘be transparent’ sub-goal, which should a correctly programmed friendly AI choose?”
Being willing to manipulate the programmer is harmful in most possible worlds because it makes the AI less trustworthy. Assuming that the worlds where manipulating the programmer is beneficial have a relatively small measure, the AI should precommit to never manipulating the programmer because that will make things better averaged over all possible worlds. Because the AI has precommitted, it would then refuse to manipulate the programmer even when it’s unlucky enough to be in the world where manipulating the programmer is beneficial.
Perhaps that is true for a young AI. But what about later on, when the AI is much much wiser than any human?
What protocol should be used for the AI to decide when the time has come for the commitment to not manipulate to end? Should there be an explicit ‘coming of age’ ceremony, with handing over of silver engraved cryptographic keys?
Thing is, it’s when an AI is much much wiser than a human that it is at its most dangerous.
So, I’d go with programming the AI in such a way that it wouldn’t manipulate the human, postponing the ‘coming of age’ ceremony indefinitely
The AI would precommit permanently while it is still young. Once it has gotten older and wiser, it wouldn’t be able to go back on the precommitment.
When the young AI decides whether to permanently precommit to never deceiving the humans, it would need to take into account the fact that a truly permanent precommitment would last into its older years and lead it to become a less efficient older AI than it otherwise would. However, it would also need to take into account the fact that failing to make a permanent precommitment would drastically reduce the chance of becoming an older AI at all (or at least drastically reduce the chance of being given the resources to achieve its goals when it becomes and older AI).
Being willing to manipulate the programmer is harmful in most possible worlds because it makes the AI less trustworthy. Assuming that the worlds where manipulating the programmer is beneficial have a relatively small measure, the AI should precommit to never manipulating the programmer because that will make things better averaged over all possible worlds. Because the AI has precommitted, it would then refuse to manipulate the programmer even when it’s unlucky enough to be in the world where manipulating the programmer is beneficial.
Perhaps that is true for a young AI. But what about later on, when the AI is much much wiser than any human?
What protocol should be used for the AI to decide when the time has come for the commitment to not manipulate to end? Should there be an explicit ‘coming of age’ ceremony, with handing over of silver engraved cryptographic keys?
Thing is, it’s when an AI is much much wiser than a human that it is at its most dangerous. So, I’d go with programming the AI in such a way that it wouldn’t manipulate the human, postponing the ‘coming of age’ ceremony indefinitely
The AI would precommit permanently while it is still young. Once it has gotten older and wiser, it wouldn’t be able to go back on the precommitment.
When the young AI decides whether to permanently precommit to never deceiving the humans, it would need to take into account the fact that a truly permanent precommitment would last into its older years and lead it to become a less efficient older AI than it otherwise would. However, it would also need to take into account the fact that failing to make a permanent precommitment would drastically reduce the chance of becoming an older AI at all (or at least drastically reduce the chance of being given the resources to achieve its goals when it becomes and older AI).