Charlie Steiner comments on Thoughts on Dangerous Learned Optimization

Charlie Steiner 20 Feb 2022 2:49 UTC
4 points
I’m curious if you have thoughts about Eliezer’s scattered arguments on the dangerousness of optimization in the recent very-long-chats (first one). It seems to me that one relevant argument I can pin onto him is something like “well, but have you imagined what it would be like if this supposedly-benign optimization actually solved hard problems?”
Like, it’s easy to say the words “A giant look-up table that has the same input-output function as a human, on the data it actually receives.” But just saying the words isn’t imagining what this thing would actually be like. First, such a look-up table would be super-universally vast. But I think that’s not even the most important thing to think about when imagining it—that question is “how did this thing get made, since we’re not allowed to just postulate it into existence?” I interpret Eliezer as arguing that if you have to somehow make a giant look-up table that has the input-ouput behavior of a powerful optimizer n some dataset, practically speaking you’re going to end up with something that is also a powerful optimizer in many other domains, not something that safely draws form a uniform distribution over off-distribution behaviors.
- peterbarnett 20 Feb 2022 14:23 UTC
  1 point
  Parent
  My initial thought is that I don’t see why this powerful optimizer would attempt to optimize things in the world, rather than just do some search thing internally.
  I agree with your framing of “how did this thing get made, since we’re not allowed to just postulate it into existence?”. I can imagine a language model which manages to output words which causes strokes in whoever reads its outputs, but I think you’d need a pretty strong case for why this would be made in practice by the training process.
  Say you have some powerful optimizer language model which answers questions. If you ask a question which is off its training distribution, I would expect it to either answer the question really well (e.g. it genealises properly), or it kinda breaks and answers the question badly. I don’t expect it to break in such a way where it suddenly decides to optimize for things in the real world. This would seem like a very strange jump to make, from ‘answer questions well’ to ‘attempt to change the state of the world according to some goal’.
  But I think if we trained the LM on ‘have a good on-going conversation with a human’, such that the model was trained with reward over time, and its behaviour would effect its inputs (because it’s a conversation), then I think it might do dangerous optimization, because it was already performing optimization to affect the state of the world. And so a distributional shift could cause this goal optimization to be ‘pointed in the wrong direction’, or uncover places where the human and AI goals become unaligned (even though they were aligned on the training distribution).