I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state st+1 where
QR(st+1,∅)=VR(π∗,st+1)
for all auxillary rewards R, where π∗ is the optimal policy according to the main reward; while making sure that there exists an action aR such that
R(t)+γQR(st+1,aR)≈QR(st,∅)
for every R. So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at t+1.
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.
The random baseline is an idea I think about from time to time, but usually I don’t dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.
if it can predict it, I’d imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won’t receive a penalty for choosing an action with expected optimal value just below the near-optimal action.
It depends whether the agent has any way of predicting what the random action will be at a future point in time.
You don’t have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace Q∗(s,ϕ) with 1|A|A∑i=1Q∗(s,ai).
For all auxillary rewards. Edited the original comment.
I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state st+1 where
for all auxillary rewards R, where π∗ is the optimal policy according to the main reward; while making sure that there exists an action aR such that
for every R. So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at t+1.
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.
what do you mean by “for all R”?
The random baseline is an idea I think about from time to time, but usually I don’t dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.
if it can predict it, I’d imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won’t receive a penalty for choosing an action with expected optimal value just below the near-optimal action.
You don’t have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace Q∗(s,ϕ) with 1|A|A∑i=1Q∗(s,ai).
For all auxillary rewards. Edited the original comment.
I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).