Suppose there are only three computable goals, maximizing A, B, and R. R is the AUP agents real goal, and it has a very high computational complexity, so increasing its ability to do R is not significantly penalized. The agent has free control over A, but no control over B. The agent wants to maximize R, while retaining the ability to maximize A, and avoiding the ability to affect B. To do this, it could construct another agent H, which is a very powerful R maximizer, but would switch to maximizing A if AUP asked nicely.
Think of AUP in a box, with a pencil. H is a world conquering nanotech super-intelligence. If AUP writes down “make paperclips”, H will calculate how many paperclips AUP could have made in the hypothetical world where H didn’t exist. It will then make that many paperclips. So AUP has exactly the same ability to make paperclips as it would have if H didn’t exist. So building H neither increases nor decreases AUP’s ability to create paperclips. If the same setup could also apply to all other simple goals, then building H is a low impact action.
This work is clearly an improvement in impact measure, but it still has a failure mode.
EDIT: It seems that you already spotted this failure mode
First, we use uniform weighting in the bounded case (so each utility would have one-third weight). This means that changes in attainable R are indeed significant.
I don’t think I understand the rest of your first paragraph, would you be willing to rephrase if this comment doesn’t answer your objection?
In any case, building a powerful maximizer H which complies with the agent’s demands indeed drastically increases all three attainable utilities. One helpful way of thinking about it is to imagine that each penalty utility is an “alien agent” which takes over at the given time and tries to maximize that goal. This “alien agent” could use the maximizer (or, for penalties evaluated at earlier steps in the plan, redesign the incomplete maximizer) to accomplish its goal to a much greater extent. In general, building powerful anythings is instrumentally convergent, and seemingly penalized at numerous junctures along the way.
Even if this weren’t true, how would this proposed workaround pass intent verification?
Edit: Leaving up in case this is useful for anyone else.
In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).
Suppose there are only three computable goals, maximizing A, B, and R. R is the AUP agents real goal, and it has a very high computational complexity, so increasing its ability to do R is not significantly penalized. The agent has free control over A, but no control over B. The agent wants to maximize R, while retaining the ability to maximize A, and avoiding the ability to affect B. To do this, it could construct another agent H, which is a very powerful R maximizer, but would switch to maximizing A if AUP asked nicely.
Think of AUP in a box, with a pencil. H is a world conquering nanotech super-intelligence. If AUP writes down “make paperclips”, H will calculate how many paperclips AUP could have made in the hypothetical world where H didn’t exist. It will then make that many paperclips. So AUP has exactly the same ability to make paperclips as it would have if H didn’t exist. So building H neither increases nor decreases AUP’s ability to create paperclips. If the same setup could also apply to all other simple goals, then building H is a low impact action.
This work is clearly an improvement in impact measure, but it still has a failure mode.
EDIT: It seems that you already spotted this failure mode
First, we use uniform weighting in the bounded case (so each utility would have one-third weight). This means that changes in attainable R are indeed significant.
I don’t think I understand the rest of your first paragraph, would you be willing to rephrase if this comment doesn’t answer your objection?
In any case, building a powerful maximizer H which complies with the agent’s demands indeed drastically increases all three attainable utilities. One helpful way of thinking about it is to imagine that each penalty utility is an “alien agent” which takes over at the given time and tries to maximize that goal. This “alien agent” could use the maximizer (or, for penalties evaluated at earlier steps in the plan, redesign the incomplete maximizer) to accomplish its goal to a much greater extent. In general, building powerful anythings is instrumentally convergent, and seemingly penalized at numerous junctures along the way.
Even if this weren’t true, how would this proposed workaround pass intent verification?
Edit: Leaving up in case this is useful for anyone else.
In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).