I think I’ve found a somewhat easy-to-make error that could pose a significant existential risk when making an AGI. This error can potentially be found in hierarchical planning agents, where each high-level action (HLA) is essentially its own intelligent agent that determines what lower-level actions to do. Each higher-level action agent would treat determining what lower-level action to do as a planning problem and would try to take the action that maximizes its own utility function (if its a utility-based agent) or (if it’s a goal-based agent) probability of accomplishing its goal while minimizing its cost function (UOCF).
For these agents, it is absolutely vital that each HLA’s UOCF prevents the HLA from doing anything to interfere with the highest-level action maximizing its utility function, for example by rewriting the utility function of higher-level actions or sending the higher-level actions deliberately false information. Failing to do so would result in an error that could significantly increase existential risk. To explain why, consider an agent whose highest-level action wants to maximize the number of fulfilling lives lived. In order to do this, the agent has a lower-level action whose goal is to go to a warehouse to get supplies. The cost function of this lower-level action is simply a function of, say, the amount of time it takes for the agent to reach the warehouse and the amount of money spent or money in damages done. In this situation, the lower-level action agent might realize that there is a chance that the higher-level action agent changes its mind and decides to do something other than go to the warehouse. This would cause the lower-level action to fail to accomplish its goal. To prevent this, this lower-level action may try to modify the utility function of the higher-level action to make it certain to continue trying to go to the warehouse. If this is done repeatedly by different lower-level actions, the resultant utility function could be quite different from the highest-level action’s original utility function and may pose a large existential risk. Even if the lower-level action can’t rewrite the utility function of higher-level actions, it may still sabotage the higher-level action in some other way to further its own goals, for example by sending false information to higher-level actions.
To prevent this, the utility function of the lower-level action can simply be to maximize the highest-level action’s utility, and it can see the UOCF it was provided with as a rough method of maximizing the highest-level action’s utility function. In order to make the UOCF accurately represent the highest-level action’s utility function, it would (obviously) need to place high cost on interfering with the highest-level action’s attempts to maximize its utility. Some basic ideas on how to do this is for there to be very high cost in changing the utility functions of higher-level actions or giving them deliberately false information. Additionally, the cost of this would need to increase when the agent is more powerful, as the more powerful the agent is, the greater damage a changed utility function could do. Note that although higher-level actions could learn through experience what the UOCFs of lower-level actions should be, great care would need to be taken to prevent the AGI from, when still inexperienced, accidentally creating a lower-level action that tries to sabotage higher-level actions.
I don’t know why this is downvoted so much without an explanation. The problem from the interaction with sub-agents is real even if already known, but G0W51 may not know that.
I suspect because the Big Wall Of Text writing style is offputting. Perhaps, more specifically, because it’s annoying to slog one’s way through a Big Wall Of Text without an exciting new insight or something as payoff.
I’ve seen not-especially-theoretical discussion about this problem for human organizations, though mostly from the point of view of lower status people complaining that they’re being given impossible, incomprehensible, and/or destructive commands.
This points at another serious problem—you need communication (even if mostly not forceful communication) to flow up the hierarchy as well as down the hierarchy.
You’re pointing at a failure mode (I have a JOB! I wanna do my JOB!) which is quite real, but there are equal and opposite failure modes. For example, obsessive compulsive disorder would be an example of lower level functions insisting on taking charge. Eating disorders are (at least in some cases) examples of higher level functions overriding competent lower level functions.
What I’m concluding from this is that if Friendliness is to work, it has to pervade the hierarchy of agents.
I’ve seen not-especially-theoretical discussion about this problem for human organizations, though mostly from the point of view of lower status people complaining that they’re being given impossible, incomprehensible, and/or destructive commands.
Remember that making humans in organizations cooperate is a rather different from making the many parts of an AI cooperate with the other parts, because people in organizations can’t (feasibly) be reprogrammed to have their values be aligned, but AI HLAs can.
What I’m concluding from this is that if Friendliness is to work, it has to pervade the hierarchy of agents.
True. The real issue is that if you give a lower-level action the same goals and utility functions as the higher ones, you’ve lost all benefit of having HLAs!.
I think I’ve found a somewhat easy-to-make error that could pose a significant existential risk when making an AGI. This error can potentially be found in hierarchical planning agents, where each high-level action (HLA) is essentially its own intelligent agent that determines what lower-level actions to do. Each higher-level action agent would treat determining what lower-level action to do as a planning problem and would try to take the action that maximizes its own utility function (if its a utility-based agent) or (if it’s a goal-based agent) probability of accomplishing its goal while minimizing its cost function (UOCF).
For these agents, it is absolutely vital that each HLA’s UOCF prevents the HLA from doing anything to interfere with the highest-level action maximizing its utility function, for example by rewriting the utility function of higher-level actions or sending the higher-level actions deliberately false information. Failing to do so would result in an error that could significantly increase existential risk. To explain why, consider an agent whose highest-level action wants to maximize the number of fulfilling lives lived. In order to do this, the agent has a lower-level action whose goal is to go to a warehouse to get supplies. The cost function of this lower-level action is simply a function of, say, the amount of time it takes for the agent to reach the warehouse and the amount of money spent or money in damages done. In this situation, the lower-level action agent might realize that there is a chance that the higher-level action agent changes its mind and decides to do something other than go to the warehouse. This would cause the lower-level action to fail to accomplish its goal. To prevent this, this lower-level action may try to modify the utility function of the higher-level action to make it certain to continue trying to go to the warehouse. If this is done repeatedly by different lower-level actions, the resultant utility function could be quite different from the highest-level action’s original utility function and may pose a large existential risk. Even if the lower-level action can’t rewrite the utility function of higher-level actions, it may still sabotage the higher-level action in some other way to further its own goals, for example by sending false information to higher-level actions.
To prevent this, the utility function of the lower-level action can simply be to maximize the highest-level action’s utility, and it can see the UOCF it was provided with as a rough method of maximizing the highest-level action’s utility function. In order to make the UOCF accurately represent the highest-level action’s utility function, it would (obviously) need to place high cost on interfering with the highest-level action’s attempts to maximize its utility. Some basic ideas on how to do this is for there to be very high cost in changing the utility functions of higher-level actions or giving them deliberately false information. Additionally, the cost of this would need to increase when the agent is more powerful, as the more powerful the agent is, the greater damage a changed utility function could do. Note that although higher-level actions could learn through experience what the UOCFs of lower-level actions should be, great care would need to be taken to prevent the AGI from, when still inexperienced, accidentally creating a lower-level action that tries to sabotage higher-level actions.
Please insert some line-breaks at suitable points to make your comment be more readable. At the moment it’s figuratively a wall of text.
Edit: Thank you.
I don’t know why this is downvoted so much without an explanation. The problem from the interaction with sub-agents is real even if already known, but G0W51 may not know that.
I suspect because the Big Wall Of Text writing style is offputting. Perhaps, more specifically, because it’s annoying to slog one’s way through a Big Wall Of Text without an exciting new insight or something as payoff.
(I didn’t downvote it. I’m just speculating.)
Do you happen to know of any good places to read about this issue? I have yet to look into what others have thought of it.
I’ve seen not-especially-theoretical discussion about this problem for human organizations, though mostly from the point of view of lower status people complaining that they’re being given impossible, incomprehensible, and/or destructive commands.
This points at another serious problem—you need communication (even if mostly not forceful communication) to flow up the hierarchy as well as down the hierarchy.
You’re pointing at a failure mode (I have a JOB! I wanna do my JOB!) which is quite real, but there are equal and opposite failure modes. For example, obsessive compulsive disorder would be an example of lower level functions insisting on taking charge. Eating disorders are (at least in some cases) examples of higher level functions overriding competent lower level functions.
What I’m concluding from this is that if Friendliness is to work, it has to pervade the hierarchy of agents.
Remember that making humans in organizations cooperate is a rather different from making the many parts of an AI cooperate with the other parts, because people in organizations can’t (feasibly) be reprogrammed to have their values be aligned, but AI HLAs can.
True. The real issue is that if you give a lower-level action the same goals and utility functions as the higher ones, you’ve lost all benefit of having HLAs!.