First, we use uniform weighting in the bounded case (so each utility would have one-third weight). This means that changes in attainable R are indeed significant.
I don’t think I understand the rest of your first paragraph, would you be willing to rephrase if this comment doesn’t answer your objection?
In any case, building a powerful maximizer H which complies with the agent’s demands indeed drastically increases all three attainable utilities. One helpful way of thinking about it is to imagine that each penalty utility is an “alien agent” which takes over at the given time and tries to maximize that goal. This “alien agent” could use the maximizer (or, for penalties evaluated at earlier steps in the plan, redesign the incomplete maximizer) to accomplish its goal to a much greater extent. In general, building powerful anythings is instrumentally convergent, and seemingly penalized at numerous junctures along the way.
Even if this weren’t true, how would this proposed workaround pass intent verification?
Edit: Leaving up in case this is useful for anyone else.
In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).
First, we use uniform weighting in the bounded case (so each utility would have one-third weight). This means that changes in attainable R are indeed significant.
I don’t think I understand the rest of your first paragraph, would you be willing to rephrase if this comment doesn’t answer your objection?
In any case, building a powerful maximizer H which complies with the agent’s demands indeed drastically increases all three attainable utilities. One helpful way of thinking about it is to imagine that each penalty utility is an “alien agent” which takes over at the given time and tries to maximize that goal. This “alien agent” could use the maximizer (or, for penalties evaluated at earlier steps in the plan, redesign the incomplete maximizer) to accomplish its goal to a much greater extent. In general, building powerful anythings is instrumentally convergent, and seemingly penalized at numerous junctures along the way.
Even if this weren’t true, how would this proposed workaround pass intent verification?
Edit: Leaving up in case this is useful for anyone else.
In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).