When designing a low-impact agent A, one typically wants them to maximise R0−ρ, where R0 is some positive reward function we want A to increase (ie the actual goal) and ρ is some impact measure or impact penalty. Keeping ρ low is supposed to ensure that A doesn’t make a big difference to the world.
I’ve shown that some impact measures may have issues with subagents. We can decompose the issue of subagents into two questions:
Can A build SA, an “R0−ρ”-maximising subagent, without triggering a high ρ?
Does SA allow A to cheat the impact measure and get higher reward?
Then the answers to those two questions for those three impact measures are generally[1]:
ρPossible to build SA?Incentive to build SA?Twenty billion QYesNoR-ReachabilityYesDepends on baselineAttainable utilityProbablyYes
Building the subagent
The twenty billion questions impact measure does not prevent the agent from building a subagent within its box. That’s because that measure is keyed to changes in the world, not to the power level of the agent. Until there are changes in the world, it will not trigger.
The relative reachability impact measure similarly won’t penalise the construction of a subagent (as long as the “states” it uses to define reachability don’t include specific information about the agent and any subagents). The reasoning is similar: the penalty triggers when an “irreversible” state is reached, not when the agent becomes powerful.
Building the subagent for the attainable utility impact measure is harder, since this typically increases the power of the agent. But the question is not “will any subagent trigger the impact measure?”, but “can the agent build a subagent in some way that doesn’t trigger the impact measure?” The example here shows that this is possible in some cases. Given that we don’t have any impossibility results here, we can expect that a superintelligent agent will likely find a way of doing it.
Using the subagent
Does having the subagent help to evade the impact measure (and hence increase R0 more)? For twenty billion questions, the answer is generally no[2]. Having SA as an independent agent is no different from having SA as a direct extension of A. If SA changes anything about the various variables in ρ, that triggers the impact penalty just as much as if A had done so.
But the situation is different for the attainable utility penalty. That’s because that penalty is indexed to the agent. So, for A, the impact penalty is roughly “ensure that you don’t get too much power”. But for SA, the impact penalty is “ensure that agentA doesn’t get too much power”. Given that A is actively cooperating with SA, that second penalty is much easier to satisfy, while still maximising R0: see here for how SA can then become an essentially un-restricted R0-maximiser.
For relative reachability, it depends on the baseline. For the inaction baseline, the ρ is entirely defined in terms of a distance metric between the current state st and a given state s′t that is independent of the agent. So, like the twenty billion questions, subagents don’t allow the agent to evade the impact penalty.
But if the stepwise inaction baseline is used, then s′t is defined by “what would have happened if A had done nothing the previous turn (and the future n turns)”. This is also an indexed penalty, so SA can easily maximise R0 while A does nothing, thus ensuring st=s′t and the penalty vanishes.
Since all impact measures can apply to many different setups, we can’t say anything fully definitive about every situation that uses a particular impact measure.
Technically, A might like to construct an SA far away, beyond the “reach” of ρ. But that’s because A might like to move itself beyond the reach of ρ - the subagent is just an indirect way of doing so.
Building and using the subagent
Overall summary post here.
When designing a low-impact agent A, one typically wants them to maximise R0−ρ, where R0 is some positive reward function we want A to increase (ie the actual goal) and ρ is some impact measure or impact penalty. Keeping ρ low is supposed to ensure that A doesn’t make a big difference to the world.
I’ve shown that some impact measures may have issues with subagents. We can decompose the issue of subagents into two questions:
Can A build SA, an “R0−ρ”-maximising subagent, without triggering a high ρ?
Does SA allow A to cheat the impact measure and get higher reward?
I’ll consider three impact measures: the twenty billion questions measure from my old low-impacts paper, the relative reachability measure from Krakovna et al’s paper, and the attainable utility measure from Turner et al’s paper.
Then the answers to those two questions for those three impact measures are generally[1]:
ρPossible to build SA?Incentive to build SA?Twenty billion QYesNoR-ReachabilityYesDepends on baselineAttainable utilityProbablyYes
Building the subagent
The twenty billion questions impact measure does not prevent the agent from building a subagent within its box. That’s because that measure is keyed to changes in the world, not to the power level of the agent. Until there are changes in the world, it will not trigger.
The relative reachability impact measure similarly won’t penalise the construction of a subagent (as long as the “states” it uses to define reachability don’t include specific information about the agent and any subagents). The reasoning is similar: the penalty triggers when an “irreversible” state is reached, not when the agent becomes powerful.
Building the subagent for the attainable utility impact measure is harder, since this typically increases the power of the agent. But the question is not “will any subagent trigger the impact measure?”, but “can the agent build a subagent in some way that doesn’t trigger the impact measure?” The example here shows that this is possible in some cases. Given that we don’t have any impossibility results here, we can expect that a superintelligent agent will likely find a way of doing it.
Using the subagent
Does having the subagent help to evade the impact measure (and hence increase R0 more)? For twenty billion questions, the answer is generally no[2]. Having SA as an independent agent is no different from having SA as a direct extension of A. If SA changes anything about the various variables in ρ, that triggers the impact penalty just as much as if A had done so.
But the situation is different for the attainable utility penalty. That’s because that penalty is indexed to the agent. So, for A, the impact penalty is roughly “ensure that you don’t get too much power”. But for SA, the impact penalty is “ensure that agent A doesn’t get too much power”. Given that A is actively cooperating with SA, that second penalty is much easier to satisfy, while still maximising R0: see here for how SA can then become an essentially un-restricted R0-maximiser.
For relative reachability, it depends on the baseline. For the inaction baseline, the ρ is entirely defined in terms of a distance metric between the current state st and a given state s′t that is independent of the agent. So, like the twenty billion questions, subagents don’t allow the agent to evade the impact penalty.
But if the stepwise inaction baseline is used, then s′t is defined by “what would have happened if A had done nothing the previous turn (and the future n turns)”. This is also an indexed penalty, so SA can easily maximise R0 while A does nothing, thus ensuring st=s′t and the penalty vanishes.
Since all impact measures can apply to many different setups, we can’t say anything fully definitive about every situation that uses a particular impact measure.
Technically, A might like to construct an SA far away, beyond the “reach” of ρ. But that’s because A might like to move itself beyond the reach of ρ - the subagent is just an indirect way of doing so.