Jeremy Gillen comments on EJT’s Shortform

Jeremy Gillen 3 Apr 2024 2:27 UTC
6 points
2
I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It’s on my todo list to work through it properly, and I expect to actually do it because it’s the blocker on me rewriting and posting my “why the shutdown problem is hard” draft, which I really want to post.
The reasons I’m a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:
- I’d be surprised if an agent with (very) incomplete preferences was real-world competent. I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
- It’s easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
It’s plausible you’ve avoided these problems but I haven’t read deeply enough to know yet. I think it’s easy for issues like this to be hidden (accidentally), so it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
- EJT 5 Apr 2024 11:28 UTC
  2 points
  1
  Parent
  it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
  Nice, interested to hear what you think!
  I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
  Yep agree that this is a concern, and I plan to think more about this soon.
  putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
  Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
  because of inner alignment issues
  I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
  this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups
  I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.
- ryan_greenblatt 3 Apr 2024 18:20 UTC
  2 points
  1
  Parent
  
  but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
  
  Fair enough, though the post itself does claim prosaic applications.