In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).
In case this wasn’t clear, the powerful agent H isn’t just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won’t make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can’t have much impact, because H is’nt listening to AUP very much.
AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won’t share the power.
AUP can be more accurately described as a power conserving agent, it doesn’t want more power, or less.
This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties… Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).