Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
Like, if someone tells me that they want me to protect nature, I know that in effect they mean “Take actions to protect nature right now, but don’t do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc.” I think a good “do what you mean” system would capture all of that. This isn’t implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.
(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.
Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
Like, if someone tells me that they want me to protect nature, I know that in effect they mean “Take actions to protect nature right now, but don’t do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc.” I think a good “do what you mean” system would capture all of that. This isn’t implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.
I believe that reduces to “solve the Friendly AI problem”.
(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.