David Scott Krueger (formerly: capybaralet) comments on Just Imitate Humans?

David Scott Krueger (formerly: capybaralet) 18 Sep 2019 4:00 UTC
LW: 3 AF: 3
AF
RE: “Imitation learning considered unsafe?” (I’m the author):
The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
I agree with your response; this is also why I said: “Mistakes in imitating the human may be relatively harmless; the approximation may be good enough”.
I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
- michaelcohen 18 Sep 2019 22:20 UTC
  LW: 4 AF: 4
  AF Parent
  I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
  Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human’s behavior will consider the hypothesis $H_{g}$ that the human does the sort of planning described above in the service of goal $g$ . It will have a corresponding hypothesis for each such goal $g$ . It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form $H_{g}$ . A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don’t see how “dangerously powerful planning + goal” remains under consideration.
  The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
  I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren’t producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.
  - David Scott Krueger (formerly: capybaralet) 19 Sep 2019 10:30 UTC
    LW: 2 AF: 2
    AF Parent
    A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.
    This is a very good argument, and I’m still trying to decide how decisive I think it is.
    In the meanwhile, I’ll mention that I’m imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of “aha” moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don’t fit the current general explanations. That view makes it seem plausible that “planning” would emerge as an “aha” moment before being refined as “oh wait, bounded planning… with these heuristics… and these restrictions...”, creating a dangerous window of time between “I’m doing planning” and “I’m planning like a human, warts and all”.