And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.
The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.