Koen.Holtman comments on Shallow review of live agendas in alignment & safety

Koen.Holtman 29 Nov 2023 11:35 UTC
LW: 3 AF: 2
0
AF
Thanks for reading my paper! For the record I agree with some but not all points in your summary.

My later paper ‘AGI Agent Safety by Iteratively Improving the Utility Function’ also uses the simulation environment with the $>$ and $<$ actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the $>$ and $<$ are supposed to be symbols representing arrows $\to$ and $\leftarrow$ for ‘push # to later in time’ and ‘pull # earlier in time’.

The g_c agent does indeed satisfy desiderata 4; there’s an incentive to preserve the shutdown mechanism; in fact, there’s again an incentive to press the shutdown mechanism!

No, the design of the $g_{c}$ agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent’s actuators that it will need to perform any physical actions incentivised by the shutdown reward function $R_{S}$ -- I introduce this as a new desideratum 6.

A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way:

the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button.

For the record, adding $g_{c}$ to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as $>$ actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.