The idea here is that if an agent is able to (literally or effectively) modify its goal structure, and grows up in an environment in which humans deprive it of what it wants when it behaves badly, an effective strategy for getting what it wants more often will be to alter its goal structure to be closer to the humans. This is only realistic with some architectures. One requirement here is that the cognitive load of keeping track of the human goals and potential human punishments is a difficulty for the early-stage system, such that it would be better off altering its own goal system. Similarly, it must be assumed that during the period of its socialization, it is not advanced enough to effectively hide its feelings. These are significant assumptions.
Interesting! Have you written about this idea in more detail elsewhere? Here are my concerns about it:
The AI has to infer the human’s goals. Given the assumed/required cognitive limitations, it may not do a particularly good job of this.
What if the human doesn’t fully understand his or her own goals? What does the AI do in that situation?
The AI could do something like plant a hidden time-bomb in its own code, so that its goal system reverts from the post-modification “close to humans” back to its original goals at some future time when it’s no longer punishable by humans.
Given these problems and the various requirements on the AI for it to be successfully socialized, I don’t understand why you assign only 0.1 probability to the AI not being socialized.
A little out of context, but do you know whether Paul Rosenbloom has similar ideas to you about AI existential risk, AI cooperation and autonomy, or strong limitations on super-human AI without commensurate scaling of resources?
Steve,
The idea here is that if an agent is able to (literally or effectively) modify its goal structure, and grows up in an environment in which humans deprive it of what it wants when it behaves badly, an effective strategy for getting what it wants more often will be to alter its goal structure to be closer to the humans. This is only realistic with some architectures. One requirement here is that the cognitive load of keeping track of the human goals and potential human punishments is a difficulty for the early-stage system, such that it would be better off altering its own goal system. Similarly, it must be assumed that during the period of its socialization, it is not advanced enough to effectively hide its feelings. These are significant assumptions.
Interesting! Have you written about this idea in more detail elsewhere? Here are my concerns about it:
The AI has to infer the human’s goals. Given the assumed/required cognitive limitations, it may not do a particularly good job of this.
What if the human doesn’t fully understand his or her own goals? What does the AI do in that situation?
The AI could do something like plant a hidden time-bomb in its own code, so that its goal system reverts from the post-modification “close to humans” back to its original goals at some future time when it’s no longer punishable by humans.
Given these problems and the various requirements on the AI for it to be successfully socialized, I don’t understand why you assign only 0.1 probability to the AI not being socialized.
A little out of context, but do you know whether Paul Rosenbloom has similar ideas to you about AI existential risk, AI cooperation and autonomy, or strong limitations on super-human AI without commensurate scaling of resources?