Steven Byrnes comments on Thoughts on “Human-Compatible”

Steven Byrnes 12 Oct 2019 1:07 UTC
5 points
I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)

For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)

I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.

Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:

Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”

I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).