Stuart_Armstrong comments on Attainable Utility Preservation: Scaling to Superhuman

Stuart_Armstrong 27 Feb 2020 13:45 UTC
3 points

Do you expect it to fail in other areas?

Yes. Subagent problems are not cleanly separated from other problems (see section 3.4 of https://www.lesswrong.com/posts/mdQEraEZQLg7jtozn/subagents-and-impact-measures-full-and-fully-illustrated , where the subagent is replaced with a rock). The impact penalty encourages the agent to put restrictions on their own future possible actions. Doing this through a subagent is one way, but there are many others (see Odysseus and the sirens, or section 6.2 of the post above in this comment).
- Logan Riggs 27 Feb 2020 18:26 UTC
  1 point
  Parent
  Thanks for the link (and the excellent write-up of the problem)!
  Regarding the setting, how would the agent gain the ability to create a sub-agent, roll a rock, or limit it’s own abilities initially? Throughout AUP, you normally start with a high penalty for acquiring power, and then you scale it down to reach reasonable, non-catastrophic plans, but your post begins with having higher power.
  I don’t think AUP prevents abuse of power you have currently have (?), but prevents gaining that power in the first place.
  - Stuart_Armstrong 28 Feb 2020 11:15 UTC
    3 points
    Parent
    The AUP is supposed to prevent the agent accumulating power. The AI initially has huge potential power (because its potential power is all the power it could ever accumulate, given its best strategy to accumulate power) and the penalty is supposed to prevent it turning that potential into actual power—as measured by AUP.
    
    So the AI always has the power to build a subagent; that post just shows that it can do this without triggering the AUP-power penalty.