I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)
For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)
I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.
Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:
Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”
I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).
I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)
For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)
I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.
Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:
Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”
I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).