(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.
(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.