My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models. But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.
Maybe what you’re thinking is: “Maybe the learned planning algorithm will have some weird and dangerous goal”. My hunch is: (1) if the original RL agent lacks an affordance for planning in the human-written source code, then it won’t work very well, and in particular, it won’t be up to the task of building a sophisticated dangerous planner with a misaligned goal; (2) if the original RL agent has an affordance for planning in the human-written source code, then it could make a dangerous misaligned planner, but it would be a “mistake” analogous to how future humans might unintentionally make misaligned AGIs, and this problem might be solvable by making the AI read about the alignment problem and murphyjitsu and red-teaming etc., and cranking up its risk-aversion etc.
Sorry if I’m misunderstanding. RL² stuff has never made much sense to me.
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Hm, I don’t think this quite captures what I view the post as saying.
Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.
As far as there is a safety-related claim in the post, this captures it much better than the previous quote.
But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.
I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that’s a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).
I can also imagine a middle ground between our hunches that looks something like “We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn’t force it to learn one, yet it did.”
One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.
See Section 3 here for why I think it would be a lot worse.
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models. But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.
Maybe what you’re thinking is: “Maybe the learned planning algorithm will have some weird and dangerous goal”. My hunch is: (1) if the original RL agent lacks an affordance for planning in the human-written source code, then it won’t work very well, and in particular, it won’t be up to the task of building a sophisticated dangerous planner with a misaligned goal; (2) if the original RL agent has an affordance for planning in the human-written source code, then it could make a dangerous misaligned planner, but it would be a “mistake” analogous to how future humans might unintentionally make misaligned AGIs, and this problem might be solvable by making the AI read about the alignment problem and murphyjitsu and red-teaming etc., and cranking up its risk-aversion etc.
Sorry if I’m misunderstanding. RL² stuff has never made much sense to me.
Hm, I don’t think this quite captures what I view the post as saying.
As far as there is a safety-related claim in the post, this captures it much better than the previous quote.
I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that’s a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).
I can also imagine a middle ground between our hunches that looks something like “We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn’t force it to learn one, yet it did.”
Thanks!
See Section 3 here for why I think it would be a lot worse.