To see this, imagine the AUP agent builds a subagent to make Q∗R(s,a)≈Q∗R(s,∅) for all future s,a, in order to neutralize the penalty term. This means it can’t make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.
I believe this is incorrect. The a and ∅ are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising R to the upmost.
I believe this is incorrect. The a and ∅ are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising R to the upmost.