I was initially writing a comment about how AUPconceptual doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn’t press a button that blows up the earth...)
It does seem that AUPconceptual will make it so an agent doesn’t want to be shut off, though. If it’s shut off, its power goes way down (to zero if it won’t be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like “AUPconceptual will always prevent an agent from gaining enough power to resist being switched off”
I was initially writing a comment about how AUPconceptual doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
It does seem that AUPconceptual will make it so an agent doesn’t want to be shut off, though.
I think this is much less of a problem in the “penalize increases with respect to agent inaction” scenario.
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.
I was initially writing a comment about how AUPconceptual doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn’t press a button that blows up the earth...)
It does seem that AUPconceptual will make it so an agent doesn’t want to be shut off, though. If it’s shut off, its power goes way down (to zero if it won’t be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like “AUPconceptual will always prevent an agent from gaining enough power to resist being switched off”
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
I think this is much less of a problem in the “penalize increases with respect to agent inaction” scenario.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.