axioman comments on Attainable Utility Preservation: Scaling to Superhuman

axioman 27 Feb 2020 20:44 UTC
LW: 3 AF: 1
AF
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state $s_{t + 1}$ where
$Q_{R} (s_{t + 1}, \emptyset) = V_{R} (π^{*}, s_{t + 1})$
for all auxillary rewards $R$ , where $π^{*}$ is the optimal policy according to the main reward; while making sure that there exists an action $a_{R}$ such that
$R (t) + γ Q_{R} (s_{t + 1}, a_{R}) \approx Q_{R} (s_{t}, \emptyset)$
for every $R$ . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at $t + 1$ .
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.
- TurnTrout 27 Feb 2020 21:26 UTC
  LW: 2 AF: 1
  AF Parent
  what do you mean by “for all $R$ ”?
  
  The random baseline is an idea I think about from time to time, but usually I don’t dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.
  
  if it can predict it, I’d imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won’t receive a penalty for choosing an action with expected optimal value just below the near-optimal action.
  - Rohin Shah 15 Mar 2020 1:06 UTC
    LW: 4 AF: 3
    AF Parent
    It depends whether the agent has any way of predicting what the random action will be at a future point in time.
    You don’t have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace $Q^{*} (s, ϕ)$ with $\frac{1}{| A |} A \sum i = 1 Q^{*} (s, a_{i})$ .
  - axioman 27 Feb 2020 22:06 UTC
    1 point
    Parent
    For all auxillary rewards. Edited the original comment.
    I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
    Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).