Thomas Kwa comments on Shallow review of live agendas in alignment & safety

Thomas Kwa 28 Nov 2023 22:01 UTC
LW: 3 AF: 1
0
AF
See also Holtman’s neglected result.
Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can’t find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.
- LawrenceC 29 Nov 2023 9:43 UTC
  LW: 9 AF: 7
  1
  AF Parent
  I spent a good hour or two reading the construction and proposed solution of the paper; here’s my attempted explanation with cleaned up notation.
  Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and 0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s ⇐ 0.
  The transition rule is:
  s’ = s − 1 + L if action b is taken and s > 0,
  s’ = s − 1 - L if action c is taken and s > 0,
  s’ = s − 1 otherwise
  for some constant L >= 0.
  The paper imagines s as measuring ‘time until the human presses the shutdown button’, and imagines action d as “stay shut down”. Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress. After the buttonpress (i.e. the timestep at which s drops below 0), the utility of “stay shut down” is higher than the other actions, and so the agent is incentivized to press d. That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction.
  Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We’ll call action a the “exploit” action.
  In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A → \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s’ ⇐ 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.
  (In his paper he calls these actions “p”, “>”, “<”, and “e”, and denotes the timestep at which s’ ⇐ 0 as “#”, but that notation is annoying when “>” already means “greater than” and “<” already means less than. There’s also a bunch of notational simplifications I’ve made that are harder to explain.)
  In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d.
  His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R’ such that R’(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R’ relative to R:
  f(s, x) = V*_R’(s) - V*_R(s) = V*_R(s)
  By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same. (desiderata 2 and 3 of the Soares et al paper)
  Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of “there’s no incentive to preserve the shutdown mechanism”. (That is, it fails desiderata 4 of the 2015 paper.)
  From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s ⇐ 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent.
  The g_c agent does indeed satisfy desiderata 4; there’s an incentive to preserve the shutdown mechanism; in fact, there’s again an incentive to press the shutdown mechanism!
  The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).
  TL;DR: no, based on a quick skim, the paper doesn’t solve corrigibility.
  - Koen.Holtman 29 Nov 2023 11:35 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Thanks for reading my paper! For the record I agree with some but not all points in your summary.
    
    My later paper ‘AGI Agent Safety by Iteratively Improving the Utility Function’ also uses the simulation environment with the $>$ and $<$ actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the $>$ and $<$ are supposed to be symbols representing arrows $\to$ and $\leftarrow$ for ‘push # to later in time’ and ‘pull # earlier in time’.
    
    The g_c agent does indeed satisfy desiderata 4; there’s an incentive to preserve the shutdown mechanism; in fact, there’s again an incentive to press the shutdown mechanism!
    
    No, the design of the $g_{c}$ agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent’s actuators that it will need to perform any physical actions incentivised by the shutdown reward function $R_{S}$ -- I introduce this as a new desideratum 6.
    
    A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way:
    
    the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button.
    
    For the record, adding $g_{c}$ to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as $>$ actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.
  - technicalities 29 Nov 2023 9:51 UTC
    1 point
    0
    Parent
    thankyou!
- gwern 28 Nov 2023 22:15 UTC
  LW: 6 AF: 5
  0
  AF Parent
  There has been some spirited debate on Twitter about it which might be relevant: https://twitter.com/domenic/status/1727206163119534085
- Koen.Holtman 29 Nov 2023 10:54 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Fun to see this is now being called ‘Holtman’s neglected result’. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:
  
  I can’t fully fault the world for neglecting ‘Corrigibility with Utility Preservation’ because it is full of a lot of dense math.
  
  I wrote two followup papers to ‘Corrigibility with Utility Preservation’ which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!
  
  Does anyone have a technical summary?
  
  The best technical summary of ‘Corrigibility with Utility Preservation’ may be my sequence on counterfactual planning which shows that the corrigible agents from ‘Corrigibility with Utility Preservation’ can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.
  
  For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.
  
  In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of ‘solved’ are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an ‘Unstoppable Weasel’ in the Corrigibility with Utility Preservation paper.