Daniel Kokotajlo comments on AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Daniel Kokotajlo 26 Nov 2022 10:39 UTC
2 points
0
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.

Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
- paulfchristiano 3 Dec 2022 19:31 UTC
  2 points
  0
  Parent
  Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
  My high-level take on this disagreement:
  If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
  I think that:
  1. If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
  2. If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
  3. If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
  And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.