0) is autonomous (has, or forms, long-term goals) 1) is not socialized
#1 is important because a self-modifying system will tend to respond to negative reinforcement concerning sociopathic behaviors resulting from #3-- though, it must be admitted, this will depend on how deeply the ability to self-modify runs. Not all architectures will be capable of effectively modifying their goals in response to social pressures. (In fact, rigid goal-structure under self-modification will usually be seen as an important design-point.)
Abram: Could you make this more precise?
From the way you used the concept of “negative reinforcement”, it sounds like you have a particular family of agent architectures in mind, which is constrained enough that we can predict that the agent will make human-like generalizations about a reward-relevant boundary between “sociopathic” behaviors and “socialized” ones. It also sounds like you have a particular class of possible emergent socially-enforced concepts of “sociopathic” in mind, which is constrained enough that we can predict that the behaviors permitted by that sociopathy concept wouldn’t still be an existential catastrophe from our point of view. But you haven’t said enough to narrow things down that much.
For example, you could have two initially sociopathic agents, who successfully manipulate each other into committing all their joint resources to a single composite agent having a weighted combination of their previous utility functions. The parts of this combined agent would be completely trustworthy to each other, so that the agents could be said to have modified their goals in response to the “social pressures” of their two-member society. But the overall agent could still be perfectly sociopathic toward other agents who were not powerful enough to manipulate it into making similar concessions.
The idea here is that if an agent is able to (literally or effectively) modify its goal structure, and grows up in an environment in which humans deprive it of what it wants when it behaves badly, an effective strategy for getting what it wants more often will be to alter its goal structure to be closer to the humans. This is only realistic with some architectures. One requirement here is that the cognitive load of keeping track of the human goals and potential human punishments is a difficulty for the early-stage system, such that it would be better off altering its own goal system. Similarly, it must be assumed that during the period of its socialization, it is not advanced enough to effectively hide its feelings. These are significant assumptions.
Interesting! Have you written about this idea in more detail elsewhere? Here are my concerns about it:
The AI has to infer the human’s goals. Given the assumed/required cognitive limitations, it may not do a particularly good job of this.
What if the human doesn’t fully understand his or her own goals? What does the AI do in that situation?
The AI could do something like plant a hidden time-bomb in its own code, so that its goal system reverts from the post-modification “close to humans” back to its original goals at some future time when it’s no longer punishable by humans.
Given these problems and the various requirements on the AI for it to be successfully socialized, I don’t understand why you assign only 0.1 probability to the AI not being socialized.
A little out of context, but do you know whether Paul Rosenbloom has similar ideas to you about AI existential risk, AI cooperation and autonomy, or strong limitations on super-human AI without commensurate scaling of resources?
Abram: Could you make this more precise?
From the way you used the concept of “negative reinforcement”, it sounds like you have a particular family of agent architectures in mind, which is constrained enough that we can predict that the agent will make human-like generalizations about a reward-relevant boundary between “sociopathic” behaviors and “socialized” ones. It also sounds like you have a particular class of possible emergent socially-enforced concepts of “sociopathic” in mind, which is constrained enough that we can predict that the behaviors permitted by that sociopathy concept wouldn’t still be an existential catastrophe from our point of view. But you haven’t said enough to narrow things down that much.
For example, you could have two initially sociopathic agents, who successfully manipulate each other into committing all their joint resources to a single composite agent having a weighted combination of their previous utility functions. The parts of this combined agent would be completely trustworthy to each other, so that the agents could be said to have modified their goals in response to the “social pressures” of their two-member society. But the overall agent could still be perfectly sociopathic toward other agents who were not powerful enough to manipulate it into making similar concessions.
Steve,
The idea here is that if an agent is able to (literally or effectively) modify its goal structure, and grows up in an environment in which humans deprive it of what it wants when it behaves badly, an effective strategy for getting what it wants more often will be to alter its goal structure to be closer to the humans. This is only realistic with some architectures. One requirement here is that the cognitive load of keeping track of the human goals and potential human punishments is a difficulty for the early-stage system, such that it would be better off altering its own goal system. Similarly, it must be assumed that during the period of its socialization, it is not advanced enough to effectively hide its feelings. These are significant assumptions.
Interesting! Have you written about this idea in more detail elsewhere? Here are my concerns about it:
The AI has to infer the human’s goals. Given the assumed/required cognitive limitations, it may not do a particularly good job of this.
What if the human doesn’t fully understand his or her own goals? What does the AI do in that situation?
The AI could do something like plant a hidden time-bomb in its own code, so that its goal system reverts from the post-modification “close to humans” back to its original goals at some future time when it’s no longer punishable by humans.
Given these problems and the various requirements on the AI for it to be successfully socialized, I don’t understand why you assign only 0.1 probability to the AI not being socialized.
A little out of context, but do you know whether Paul Rosenbloom has similar ideas to you about AI existential risk, AI cooperation and autonomy, or strong limitations on super-human AI without commensurate scaling of resources?