Stuart_Armstrong comments on Appendix: how a subagent could get powerful

Stuart_Armstrong 12 Feb 2020 8:22 UTC
LW: 7 AF: 4
AF
Not quite… “If the auxiliary reward does not benefit from two agents instead of one optimizing it” should be “If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)”—it’s not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.

For the opinion… the subagent does not have “commitments to balance out the original agent’s attainable utility”. The subagent has exactly the same goal as the original agent, namely $R_{0} - PENALTY$ (or $R_{0} - d_{A U}$ ). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise $R_{0}$ while constraining the penalty term on the first agent.

That’s why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).

Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the $R_{0} - PENALTY$ goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.
- axioman 12 Feb 2020 15:05 UTC
  4 points
  Parent
  “Not quite… ” are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
  I am not sure I understand: In my mind “commitments to balance out the original agent’s attainable utility” essentially refers to the second agent being penalized by the the first agent’s penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to “SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ”.
  My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it’s reversible, right?
  The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).
  - Stuart_Armstrong 13 Feb 2020 14:36 UTC
    2 points
    Parent
    Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general