Stuart_Armstrong comments on Attainable Utility Preservation: Scaling to Superhuman

Stuart_Armstrong 7 Apr 2022 12:46 UTC
LW: 3 AF: 3
AF

To see this, imagine the AUP agent builds a subagent to make $Q_{R}^{*} (s, a) \approx Q_{R}^{*} (s, \emptyset)$ for all future $s, a$ , in order to neutralize the penalty term. This means it can’t make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The $a$ and $\emptyset$ are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising $R$ to the upmost.