I was initially writing a comment about how AUPconceptual doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
It does seem that AUPconceptual will make it so an agent doesn’t want to be shut off, though.
I think this is much less of a problem in the “penalize increases with respect to agent inaction” scenario.
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.
And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
I think this is much less of a problem in the “penalize increases with respect to agent inaction” scenario.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.