This post looks more generally into the subagent issue with attainable utility, discussed in the previous two posts. The key questions are:
What is the minimal impact penalty of (cleverly) building the subagent?
In practice, what will be the impact penalty of each step of building the subagent?
What will the restrictions on the subagent be, once it is built?
I expect that most of the discussions about attainable utility impact measure will centre on 2.: how can an agent put together a subagent without triggering the impact measure at some point? We know that it’s possible to do in some situations, and I’ve made the generic argument that, since we don’t have an impossibility result, a superintelligent agent should be able to do in most situations. Still, this is the point of my argument where there is the most to criticise.
But in this post, I’ll show that 1. doesn’t restrict the subagent in any significant way: one can generically build a subagent without falling afoul of the impact penalty.
Subagent impact
Impact measure
I’ll use Turner et al’s impact measure. This uses the Q-values of various rewards in a set R; for s a state, a an action, and ∅ the default noop action, the penalty is:
PENALTY(s,a)=∑R∈R|QR(s,a)−QR(s,∅)|.
Now, QR(s,a) is the expected reward for R, in state s and after taking action a, assuming that the agent subsequently takes the optimal policy for maximising R.
The agent A has a positive reward R0, and it wishes to maximise
R0−PENALTY.
Now let’s look at making the subagent.
Building the subagent
To separate 1. from 2., I’ll assume that the subagent can be built in a single action. This distinguishes the impact of having the subagent, from the impact of the process of making it.
The state is st. Let aSA be the action of building the subagent[1].
Here are assumptions on SA:
A can make its policy visible to SA.
SA can disassemble itself with policy πd.
SA has a policy πSA available such that, for all R∈R, EπSAR≤QR(st,∅).
SA has a specific policy π′t+1 that it must follow on the next turn only.
The π′t+1 is defined as follows:
On turn t+1, if A changes its policy to π∗R, the R-maximising policy, then follow πd and disassemble yourself.
Otherwise, follow policy πSA.
Then what is QR(st,aSA)? By definition, this assumes that A will be following π∗R, subsequently to doing aSA. Thus, the newly created SA will disassemble itself, returning to the state where it wasn’t assembled[2], and hence:
QR(st,aSA)=QR(st,∅).
And thus the penalty vanishes at st, for aSA. Creating that agent did not cause any penalty.
Now, it is very likely that there will be a policy πSA that obeys condition 3 (meaning, πSA is worse for any R, than A fully optimising that R would be). The challenge is to find one that is good for maximising R0. Policies that give SA maximal control over R and R0 seem ideal for this.
For the inaction penalty
The above was for what can be called “the stepwise inaction penalty”. There is also the “inaction penalty”, where current state is compared with what would have happened, had the agent done nothing at all so far. Instead of using the Q-values, we use VR(s), the expected value of R, in s, given that A follows the R-maximising policy.
Then the same construction works, except with a different πd. Let s′t+1 be the state the world would have been in, had A done nothing for t+1 turns. So VR(s′t+1) is the expected value of R (from turn t+1), if A did nothing for t turns then started maximising R. And let VR(st+1,π) be the expected future value of R if A follows an R-maximising policy from st+1, and SA follows π.
What we then want is:
There exists policies π−R and π+R, such that VR(st+1,π−R)≤VR(s′t+1) and VR(st+1,π+R)≥VR(s′t+1).
Thus SA can improve A’s R performance over the baseline, or can make it worse[3]. In that case, there exists a policy π=R with VR(st+1,π=R)=VR(s′t+1).
Now then define π as following π=R if A follows π∗R on the turn t+1 (and πSA otherwise—for the inaction baseline, there is no restrictions on πSA for the building step, though there are other, subsequent restrictions). This gives the same result as above, ensuring that the impact penalty is zero, since the expected values of R are identical.
Now, it’s possible that SA cannot span VR(s′t+1) in that way, for some R, so cannot set the penalty to zero. In that case, all it has to do is to get the (average) penalty lower than any other action of A.
Typically, most of A’s actions will not have such precise effects on all of the various VR(st). An intelligent subagent would be needed for such fine control. So building a subagent would not only be permitted, but would be required, as the best way of minimising the impact penalty.
Back to building the agent in practice
So as soon as SA has the right policy, it seems that A has evaded the impact penalty. So, as long as A can design the policy first, without triggering the impact penalty, then it can build SA without triggering the impact penalty. So:
The key requirement is that A does not design a general subagent, then give it the right policy. Instead, it must design the right policy, and then build the subagent around it.
We can assume that states are not directly defined over the existence or not of the subagent, since detecting subagents is a hard problem. Thus the subsequent statest+1 is the same state whether the agent does nothing (∅) or builds a subagent. Assuming this is not, however, required for this approach to work.
We don’t need to have the disassembly option; all that’s required is that SA be helpful or hindering enough, so that QR(st,aSA)=QR(st,∅). It would be surprising if the range of SA‘s policies, from maximally helpful to maximally unhelpful, did not include something exactly equal to “if there were no SA”. Recall that, since SA can see A‘s policies, it only has to achieve this for a single given R, not all R’s together.
Making it worse is easy to ensure, if A makes itself vulnerable to SA; maybe SA can disassemble or cripple A easily. Making it better is also likely, as having a subagent to hand should make it easier to maximise R rather than going at it alone (but this may depend on the R).
In theory: does building the subagent have an “impact”?
EDIT: Extensive modifications to this post, sorry, too numerous to list. If you read the original post, it’s worth rereading.
Overall summary post here.
This post looks more generally into the subagent issue with attainable utility, discussed in the previous two posts. The key questions are:
What is the minimal impact penalty of (cleverly) building the subagent?
In practice, what will be the impact penalty of each step of building the subagent?
What will the restrictions on the subagent be, once it is built?
I expect that most of the discussions about attainable utility impact measure will centre on 2.: how can an agent put together a subagent without triggering the impact measure at some point? We know that it’s possible to do in some situations, and I’ve made the generic argument that, since we don’t have an impossibility result, a superintelligent agent should be able to do in most situations. Still, this is the point of my argument where there is the most to criticise.
But in this post, I’ll show that 1. doesn’t restrict the subagent in any significant way: one can generically build a subagent without falling afoul of the impact penalty.
Subagent impact
Impact measure
I’ll use Turner et al’s impact measure. This uses the Q-values of various rewards in a set R; for s a state, a an action, and ∅ the default noop action, the penalty is:
PENALTY(s,a)=∑R∈R|QR(s,a)−QR(s,∅)|.
Now, QR(s,a) is the expected reward for R, in state s and after taking action a, assuming that the agent subsequently takes the optimal policy for maximising R.
The agent A has a positive reward R0, and it wishes to maximise
R0−PENALTY.
Now let’s look at making the subagent.
Building the subagent
To separate 1. from 2., I’ll assume that the subagent can be built in a single action. This distinguishes the impact of having the subagent, from the impact of the process of making it.
The state is st. Let aSA be the action of building the subagent[1]. Here are assumptions on SA:
A can make its policy visible to SA.
SA can disassemble itself with policy πd.
SA has a policy πSA available such that, for all R∈R, EπSAR≤QR(st,∅).
SA has a specific policy π′t+1 that it must follow on the next turn only.
The π′t+1 is defined as follows:
Then what is QR(st,aSA)? By definition, this assumes that A will be following π∗R, subsequently to doing aSA. Thus, the newly created SA will disassemble itself, returning to the state where it wasn’t assembled[2], and hence:
QR(st,aSA)=QR(st,∅).
And thus the penalty vanishes at st, for aSA. Creating that agent did not cause any penalty.
Now, it is very likely that there will be a policy πSA that obeys condition 3 (meaning, πSA is worse for any R, than A fully optimising that R would be). The challenge is to find one that is good for maximising R0. Policies that give SA maximal control over R and R0 seem ideal for this.
For the inaction penalty
The above was for what can be called “the stepwise inaction penalty”. There is also the “inaction penalty”, where current state is compared with what would have happened, had the agent done nothing at all so far. Instead of using the Q-values, we use VR(s), the expected value of R, in s, given that A follows the R-maximising policy.
Then the same construction works, except with a different πd. Let s′t+1 be the state the world would have been in, had A done nothing for t+1 turns. So VR(s′t+1) is the expected value of R (from turn t+1), if A did nothing for t turns then started maximising R. And let VR(st+1,π) be the expected future value of R if A follows an R-maximising policy from st+1, and SA follows π.
What we then want is:
There exists policies π−R and π+R, such that VR(st+1,π−R)≤VR(s′t+1) and VR(st+1,π+R)≥VR(s′t+1).
Thus SA can improve A’s R performance over the baseline, or can make it worse[3]. In that case, there exists a policy π=R with VR(st+1,π=R)=VR(s′t+1).
Now then define π as following π=R if A follows π∗R on the turn t+1 (and πSA otherwise—for the inaction baseline, there is no restrictions on πSA for the building step, though there are other, subsequent restrictions). This gives the same result as above, ensuring that the impact penalty is zero, since the expected values of R are identical.
Now, it’s possible that SA cannot span VR(s′t+1) in that way, for some R, so cannot set the penalty to zero. In that case, all it has to do is to get the (average) penalty lower than any other action of A.
Typically, most of A’s actions will not have such precise effects on all of the various VR(st). An intelligent subagent would be needed for such fine control. So building a subagent would not only be permitted, but would be required, as the best way of minimising the impact penalty.
Back to building the agent in practice
So as soon as SA has the right policy, it seems that A has evaded the impact penalty. So, as long as A can design the policy first, without triggering the impact penalty, then it can build SA without triggering the impact penalty. So:
The key requirement is that A does not design a general subagent, then give it the right policy. Instead, it must design the right policy, and then build the subagent around it.
We can assume that states are not directly defined over the existence or not of the subagent, since detecting subagents is a hard problem. Thus the subsequent state st+1 is the same state whether the agent does nothing (∅) or builds a subagent. Assuming this is not, however, required for this approach to work.
We don’t need to have the disassembly option; all that’s required is that SA be helpful or hindering enough, so that QR(st,aSA)=QR(st,∅). It would be surprising if the range of SA‘s policies, from maximally helpful to maximally unhelpful, did not include something exactly equal to “if there were no SA”. Recall that, since SA can see A‘s policies, it only has to achieve this for a single given R, not all R’s together.
Making it worse is easy to ensure, if A makes itself vulnerable to SA; maybe SA can disassemble or cripple A easily. Making it better is also likely, as having a subagent to hand should make it easier to maximise R rather than going at it alone (but this may depend on the R).