RE: “Imitation learning considered unsafe?” (I’m the author):
The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
I agree with your response; this is also why I said: “Mistakes in imitating the human may be relatively harmless; the approximation may be good enough”.
I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg. A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don’t see how “dangerously powerful planning + goal” remains under consideration.
The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren’t producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.
A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.
This is a very good argument, and I’m still trying to decide how decisive I think it is.
In the meanwhile, I’ll mention that I’m imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of “aha” moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don’t fit the current general explanations. That view makes it seem plausible that “planning” would emerge as an “aha” moment before being refined as “oh wait, bounded planning… with these heuristics… and these restrictions...”, creating a dangerous window of time between “I’m doing planning” and “I’m planning like a human, warts and all”.
RE: “Imitation learning considered unsafe?” (I’m the author):
The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
I agree with your response; this is also why I said: “Mistakes in imitating the human may be relatively harmless; the approximation may be good enough”.
I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg. A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don’t see how “dangerously powerful planning + goal” remains under consideration.
I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren’t producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.
This is a very good argument, and I’m still trying to decide how decisive I think it is.
In the meanwhile, I’ll mention that I’m imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of “aha” moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don’t fit the current general explanations. That view makes it seem plausible that “planning” would emerge as an “aha” moment before being refined as “oh wait, bounded planning… with these heuristics… and these restrictions...”, creating a dangerous window of time between “I’m doing planning” and “I’m planning like a human, warts and all”.