TurnTrout comments on Attainable Utility Preservation: Concepts

TurnTrout 18 Feb 2020 3:14 UTC
LW: 4 AF: 2
AF

The conclusion doesn’t follow from the premise.

CCC says (for non-evil goals) “if the optimal policy is catastrophic, then it’s because of power-seeking”. So its contrapositive is indeed as stated.

Note that preserving our attainable utilities isn’t a good thing, it’s just not a bad thing.

I meant “preserving” as in “not incentivized to take away power from us”, not “keeps us from benefitting from anything”, but you’re right about the implication as stated. Sorry for the ambiguity.

Is this a metaphor for making an ‘agent’ with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals?

Metaphor.

“AUP_conceptual solves this “locality” problem by regularizing the agent’s impact on the nearby AU landscape.”

Nearby from its perspective? (From a practical standpoint, if you’re close to an airport you’re close to a lot of places on earth, that you aren’t from a ‘space’ perspective.)

Nearby wrt this kind of “AU distance/practical perspective”, yes. Great catch.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that ‘building factories increases power’, but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

As a consequence of this being a more effective approach—it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing ‘opportunities to gain power’ is of some general concern. In this case because an agent that tries to minimize reducing/affecting* other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It’s not clear if increasing shows up on the AUP map, or how that’s handled.

Great thoughts. I think some of this will be answered in a few posts by the specific implementation details. What do you mean by “AUP map”? The AU landscape?

What does the agent do if it unexpectedly gains power by some means,

The idea is it only penalizes expected power gain.
- Pattern 18 Feb 2020 17:39 UTC
  LW: 1 AF: 1
  AF Parent
  CCC says (for non-evil goals) “if the optimal policy is catastrophic, then it’s because of power-seeking”. So its contrapositive is indeed as stated.
  That makes sense. One of the things I like about this approach is that it isn’t immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that’s the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
  *Unless power seeking behavior is the cause of catastrophe, rather than having power.
  Sorry for the ambiguity.
  It wasn’t ambiguous, I meant to gesture at stuff like ‘astronomical waste’ (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:
  So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
  -but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don’t seek power.
  
  What do you mean by “AUP map”? The AU landscape?
  That is what I meant originally, though upon reflection a small distinction could be made:
  Territory: AU landscape*
  Map: AUP map (an AUP agent’s model of the landscape)
  *Whether or not this is thought of as ‘Territory’ or a ‘map’, conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)
  The idea is it only penalizes expected power gain.
  Gurkenglas previously commented that they didn’t think that AUP solved ‘agents learns how to convince people/agents to do things’. While it’s not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that “power seeking behavior is the cause of catastrophe, rather than having power.” This doesn’t seem likely in its entirety, but seems possible in part—that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)
  - TurnTrout 18 Feb 2020 18:32 UTC
    LW: 2 AF: 1
    AF Parent
    
    if we make sure that power is low enough we can turn it off, if the agent will acquire power if that’s the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
    
    Yeah. I have the math for this kind of tradeoff worked out—stay tuned!
    
    Though further up this comment I brought up the possibility that “power seeking behavior is the cause of catastrophe, rather than having power.”
    
    I think this is true, actually; if another agent already has a lot of power and it isn’t already catastrophic for us, their continued existence isn’t that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.
    
    The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn’t trying to increase its own AU?