This post argues that regularizing an agent’s impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent’s power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.
This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent’s ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent’s attainable auxiliary rewards.
Flo’s opinion:
I am confused about how much the commitment to balance out the original agent’s attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent’s ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.
Not quite… “If the auxiliary reward does not benefit from two agents instead of one optimizing it” should be “If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)”—it’s not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.
For the opinion… the subagent does not have “commitments to balance out the original agent’s attainable utility”. The subagent has exactly the same goal as the original agent, namely R0−PENALTY (or R0−dAU). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise R0 while constraining the penalty term on the first agent.
That’s why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).
Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the R0−PENALTY goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.
“Not quite… ” are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
I am not sure I understand: In my mind “commitments to balance out the original agent’s attainable utility” essentially refers to the second agent being penalized by the the first agent’s penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to “SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ”.
My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it’s reversible, right?
The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).
Flo’s summary for the Alignment Newsletter:
Flo’s opinion:
Not quite… “If the auxiliary reward does not benefit from two agents instead of one optimizing it” should be “If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)”—it’s not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.
For the opinion… the subagent does not have “commitments to balance out the original agent’s attainable utility”. The subagent has exactly the same goal as the original agent, namely R0−PENALTY (or R0−dAU). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise R0 while constraining the penalty term on the first agent.
That’s why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).
Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the R0−PENALTY goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.
“Not quite… ” are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
I am not sure I understand: In my mind “commitments to balance out the original agent’s attainable utility” essentially refers to the second agent being penalized by the the first agent’s penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to “SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ”.
My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it’s reversible, right?
The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).
Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general
Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general
Nitpick: “Attainable utility regularization” should be “Attainable utility preservation”