Stuart_Armstrong comments on Attainable Utility Preservation: Scaling to Superhuman

Stuart_Armstrong Feb 27, 2020, 11:24 AM
LW: 5 AF: 3
AF
More seriously, the reason I’m sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can’t articulate that reason; it’s the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can’t formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine—I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won’t :-(
- Logan Riggs Feb 27, 2020, 12:46 PM
  1 point
  Parent
  I expect AUP to fail in embedded agency problems (which I interpret the subagent problem to be included). Do you expect it to fail in other areas?
  - Stuart_Armstrong Feb 27, 2020, 1:45 PM
    3 points
    Parent
    
    Do you expect it to fail in other areas?
    
    Yes. Subagent problems are not cleanly separated from other problems (see section 3.4 of https://www.lesswrong.com/posts/mdQEraEZQLg7jtozn/subagents-and-impact-measures-full-and-fully-illustrated , where the subagent is replaced with a rock). The impact penalty encourages the agent to put restrictions on their own future possible actions. Doing this through a subagent is one way, but there are many others (see Odysseus and the sirens, or section 6.2 of the post above in this comment).
    - Logan Riggs Feb 27, 2020, 6:26 PM
      1 point
      Parent
      Thanks for the link (and the excellent write-up of the problem)!
      Regarding the setting, how would the agent gain the ability to create a sub-agent, roll a rock, or limit it’s own abilities initially? Throughout AUP, you normally start with a high penalty for acquiring power, and then you scale it down to reach reasonable, non-catastrophic plans, but your post begins with having higher power.
      I don’t think AUP prevents abuse of power you have currently have (?), but prevents gaining that power in the first place.
      - Stuart_Armstrong 28 Feb 2020 11:15 UTC
        3 points
        Parent
        The AUP is supposed to prevent the agent accumulating power. The AI initially has huge potential power (because its potential power is all the power it could ever accumulate, given its best strategy to accumulate power) and the penalty is supposed to prevent it turning that potential into actual power—as measured by AUP.
        
        So the AI always has the power to build a subagent; that post just shows that it can do this without triggering the AUP-power penalty.