A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.
This is a very good argument, and I’m still trying to decide how decisive I think it is.
In the meanwhile, I’ll mention that I’m imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of “aha” moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don’t fit the current general explanations. That view makes it seem plausible that “planning” would emerge as an “aha” moment before being refined as “oh wait, bounded planning… with these heuristics… and these restrictions...”, creating a dangerous window of time between “I’m doing planning” and “I’m planning like a human, warts and all”.
This is a very good argument, and I’m still trying to decide how decisive I think it is.
In the meanwhile, I’ll mention that I’m imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of “aha” moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don’t fit the current general explanations. That view makes it seem plausible that “planning” would emerge as an “aha” moment before being refined as “oh wait, bounded planning… with these heuristics… and these restrictions...”, creating a dangerous window of time between “I’m doing planning” and “I’m planning like a human, warts and all”.