Rafael Harth comments on Attainable Utility Preservation: Concepts

Rafael Harth 26 Jul 2020 8:48 UTC
LW: 2 AF: 1
AF
I was initially writing a comment about how AUP $_{conceptual}$ doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn’t press a button that blows up the earth...)
It does seem that AUP $_{conceptual}$ will make it so an agent doesn’t want to be shut off, though. If it’s shut off, its power goes way down (to zero if it won’t be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like “AUP $_{conceptual}$ will always prevent an agent from gaining enough power to resist being switched off”
- TurnTrout 26 Jul 2020 12:56 UTC
  LW: 2 AF: 1
  AF Parent
  I was initially writing a comment about how AUP $_{conceptual}$ doesn’t seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)
  And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUP $_{conceptual}$ should work just fine for penalizing-increases-only.
  It does seem that AUP $_{conceptual}$ will make it so an agent doesn’t want to be shut off, though.
  I think this is much less of a problem in the “penalize increases with respect to agent inaction” scenario.
  - Rafael Harth 26 Jul 2020 13:20 UTC
    2 points
    Parent
    And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to… power gain, it seems. I think that AUP $_{conceptual}$ should work just fine for penalizing-increases-only.
    The case I had in mind was “you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you’re dead (because then you can’t get sick)”. If the AI kills you, that doesn’t seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.
    - TurnTrout 26 Jul 2020 18:19 UTC
      2 points
      Parent
      Yes, sure, but those aren’t catastrophes in the way I’ve defined it here (see also Toby Ord’s The Precipice; he espouses a similar definition). It’s not an existential threat, but you’re right that the agent might still do bad things.