Koen.Holtman comments on Disentangling Corrigibility: 2015-2021

Koen.Holtman 6 Apr 2021 15:11 UTC
LW: 1 AF: 1
AF

Could you change the definition of $R$ to $R_{N} + if button_just_pressed then V (R_{S}, R x) - V (R_{N}, R x) else 0$ , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure $R_{S}$ is large and $R_{N}$ small after the button press.

In general if you would forcefully change the agent’s reward function into some $R^{'}$ , it will self-preserve from that moment on and try to maintain this $R^{'}$ , so it won’t self-edit its $R^{'}$ back into the original form.

There are exceptions to this general rule, for special versions of $R^{'}$ and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one.

If you remove the $d n t u$ bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of ‘without changing its fundamental goals’ will get even longer and less readable than the current proofs in the paper, so that is why I did the $d n t u$ privileging.