There’s another layer of uncertainty here. For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
“do what I would want to mean” is closer, but figuring out the counterfactuals for “would” that preserve “I” is not easy.
Agreed. Humans don’t really have utility functions. We might try to get around this by having the AI learn how humans would like to be interpreted as having a utility function, and how they would like that to be interpreted, and so on in an infinite tower of reflection, but that doesn’t seem very practical or desirable.
I think there was an old Wei Dai post on “artificial philosophy” that was about this problem? The idea is we want the AI to collapse this infinite tower by learning the philosophical considerations that generate it, then use that knowledge to learn its preferences from humans.
Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
Like, if someone tells me that they want me to protect nature, I know that in effect they mean “Take actions to protect nature right now, but don’t do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc.” I think a good “do what you mean” system would capture all of that. This isn’t implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.
(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.
There’s another layer of uncertainty here. For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
“do what I would want to mean” is closer, but figuring out the counterfactuals for “would” that preserve “I” is not easy.
Agreed. Humans don’t really have utility functions. We might try to get around this by having the AI learn how humans would like to be interpreted as having a utility function, and how they would like that to be interpreted, and so on in an infinite tower of reflection, but that doesn’t seem very practical or desirable.
I think there was an old Wei Dai post on “artificial philosophy” that was about this problem? The idea is we want the AI to collapse this infinite tower by learning the philosophical considerations that generate it, then use that knowledge to learn its preferences from humans.
Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
Like, if someone tells me that they want me to protect nature, I know that in effect they mean “Take actions to protect nature right now, but don’t do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc.” I think a good “do what you mean” system would capture all of that. This isn’t implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.
I believe that reduces to “solve the Friendly AI problem”.
(Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
I’m replying to the quote from the first comment:
What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.