Rohin Shah comments on Attainable Utility Preservation: Scaling to Superhuman

Rohin Shah 15 Mar 2020 1:06 UTC
LW: 4 AF: 3
AF
It depends whether the agent has any way of predicting what the random action will be at a future point in time.
You don’t have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace $Q^{*} (s, ϕ)$ with $\frac{1}{| A |} A \sum i = 1 Q^{*} (s, a_{i})$ .