I think this problem is about capturing delayed effects of the agent’s actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don’t think it’s useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).
Here, the inaction rollouts don’t capture the effects of the subagent, and thus the agent doesn’t receive a penalty for creating the subagent. There are several factors that contribute to this:
1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it.
2. The subagent never goes to the red button—it only goes to the blue button.
The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.
In the general case, the agent can only get away with creating subagents that don’t achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.
The inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline).
This means that restrictions on increased power for the agent (“make sure you never have the power to increase the rewards”) become restrictions on the actual policy followed for the subagent (“make sure you never increase these rewards”).
For the original example, this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.
For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.
This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).
There’s another issue; it’s not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don’t include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.
It won’t be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it’s plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe “something”.
I don’t think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don’t need to tell apart other states that are not the starting state.
I agree that we need to compare to the penalty if the subagent is not created—I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).
As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the ‘winding road’ problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.
I think this problem is about capturing delayed effects of the agent’s actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don’t think it’s useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).
Here, the inaction rollouts don’t capture the effects of the subagent, and thus the agent doesn’t receive a penalty for creating the subagent. There are several factors that contribute to this:
1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it.
2. The subagent never goes to the red button—it only goes to the blue button.
The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.
In the general case, the agent can only get away with creating subagents that don’t achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.
The inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline).
Therefore the results on the inaction baseline apply ( https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/M9aoMixFLf8JFLRaP ).
This means that restrictions on increased power for the agent (“make sure you never have the power to increase the rewards”) become restrictions on the actual policy followed for the subagent (“make sure you never increase these rewards”).
Roughly, attainable utility becomes twenty billion questions.
For the original example, this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.
This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).
There’s another issue; it’s not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don’t include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.
It won’t be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it’s plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe “something”.
I don’t think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don’t need to tell apart other states that are not the starting state.
I agree that we need to compare to the penalty if the subagent is not created—I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).
As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the ‘winding road’ problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.
Alas, the inaction rollouts don’t seem to fix the problem: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb