so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest.
That’s actually what I’m aiming at with the research agenda, but the Occam’s razor argument shows that this itself is highly non-trivial, and we need some strong grounding of the definition of preference.
There’s a difference between “creating an explicit preference learning system” and “having a generally capable system learn preferences”. I think the former is difficult (because of the Occam’s razor argument) but the latter is not.
Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn’t “know” what humans would want it to do, even if it doesn’t optimize for it? (See also this failed utopia story.)
Do you think that AI system doesn’t “know” what humans would want, even if it doesn’t optimize for it?
I think the AI would not know that, because “what humans would want” is not defined. “What humans say they want”, “what, upon reflection, humans would agree they want...”, etc can be done, but “what humans want” is not a defined things about the world or about humans—without extra assumptions (which cannot be deduced from observation).
That’s actually what I’m aiming at with the research agenda, but the Occam’s razor argument shows that this itself is highly non-trivial, and we need some strong grounding of the definition of preference.
There’s a difference between “creating an explicit preference learning system” and “having a generally capable system learn preferences”. I think the former is difficult (because of the Occam’s razor argument) but the latter is not.
Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn’t “know” what humans would want it to do, even if it doesn’t optimize for it? (See also this failed utopia story.)
I think the AI would not know that, because “what humans would want” is not defined. “What humans say they want”, “what, upon reflection, humans would agree they want...”, etc can be done, but “what humans want” is not a defined things about the world or about humans—without extra assumptions (which cannot be deduced from observation).