CCC says (for non-evil goals) “if the optimal policy is catastrophic, then it’s because of power-seeking”. So its contrapositive is indeed as stated.
That makes sense. One of the things I like about this approach is that it isn’t immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that’s the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
*Unless power seeking behavior is the cause of catastrophe, rather than having power.
Sorry for the ambiguity.
It wasn’t ambiguous, I meant to gesture at stuff like ‘astronomical waste’ (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
-but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don’t seek power.
What do you mean by “AUP map”? The AU landscape?
That is what I meant originally, though upon reflection a small distinction could be made:
Territory: AU landscape*
Map: AUP map (an AUP agent’s model of the landscape)
*Whether or not this is thought of as ‘Territory’ or a ‘map’, conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)
The idea is it only penalizes expected power gain.
Gurkenglas previously commented that they didn’t think that AUP solved ‘agents learns how to convince people/agents to do things’. While it’s not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that “power seeking behavior is the cause of catastrophe, rather than having power.” This doesn’t seem likely in its entirety, but seems possible in part—that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)
if we make sure that power is low enough we can turn it off, if the agent will acquire power if that’s the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
Yeah. I have the math for this kind of tradeoff worked out—stay tuned!
Though further up this comment I brought up the possibility that “power seeking behavior is the cause of catastrophe, rather than having power.”
I think this is true, actually; if another agent already has a lot of power and it isn’t already catastrophic for us, their continued existence isn’t that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.
The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn’t trying to increase its own AU?
That makes sense. One of the things I like about this approach is that it isn’t immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that’s the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
*Unless power seeking behavior is the cause of catastrophe, rather than having power.
It wasn’t ambiguous, I meant to gesture at stuff like ‘astronomical waste’ (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:
-but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don’t seek power.
That is what I meant originally, though upon reflection a small distinction could be made:
Territory: AU landscape*
Map: AUP map (an AUP agent’s model of the landscape)
*Whether or not this is thought of as ‘Territory’ or a ‘map’, conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)
Gurkenglas previously commented that they didn’t think that AUP solved ‘agents learns how to convince people/agents to do things’. While it’s not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that “power seeking behavior is the cause of catastrophe, rather than having power.” This doesn’t seem likely in its entirety, but seems possible in part—that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)
Yeah. I have the math for this kind of tradeoff worked out—stay tuned!
I think this is true, actually; if another agent already has a lot of power and it isn’t already catastrophic for us, their continued existence isn’t that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.
The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn’t trying to increase its own AU?