RR attempted to control the side-effects of an agent by ensuring it had enough power to reach a lot of states; this effect is not neutralised by a subagent.
Things might get complicated by partial observability; in the real world, the agent is minimizing change in its beliefs about what it can reach. Otherwise, you could just get around the SA problem for AUP as well by substituting the reward functions for state indicator reward functions.
AU and RR have the same SA problem, formally, in terms of excess power; it’s just that AU wants low power and RR wants high power, so they don’t have the same problem in practice.
Things might get complicated by partial observability; in the real world, the agent is minimizing change in its beliefs about what it can reach. Otherwise, you could just get around the SA problem for AUP as well by substituting the reward functions for state indicator reward functions.
AU and RR have the same SA problem, formally, in terms of excess power; it’s just that AU wants low power and RR wants high power, so they don’t have the same problem in practice.