Oh, I see, so the argument is that conditional on the idealized synthesis algorithm being a good definition of human preferences, the AI can approximate the synthesis algorithm, and whatever utility function it comes up with and optimizes should not have any human-identifiable problems. That makes sense. Followup questions:
How do you tell the AI system to optimize for “what the idealized synthesis algorithm would do”?
How can we be confident that the idealized synthesis algorithm actually captures what we care about?
How do you tell the AI system to optimize for “what the idealized synthesis algorithm would do”?
Here’s a grounding (see 1.2), here’s a definition of what we want, here are valid ways of approximating it :-) Basically at that point, it’s kind of like approximating full Bayesian updating for an exceedingly complex system.
How can we be confident that the idealized synthesis algorithm actually captures what we care about?
Much of the rest of the research agenda is arguing that this is the case. See especially section 2.8.
But for both these points, more research is still needed.
Oh, I see, so the argument is that conditional on the idealized synthesis algorithm being a good definition of human preferences, the AI can approximate the synthesis algorithm, and whatever utility function it comes up with and optimizes should not have any human-identifiable problems. That makes sense. Followup questions:
How do you tell the AI system to optimize for “what the idealized synthesis algorithm would do”?
How can we be confident that the idealized synthesis algorithm actually captures what we care about?
Here’s a grounding (see 1.2), here’s a definition of what we want, here are valid ways of approximating it :-) Basically at that point, it’s kind of like approximating full Bayesian updating for an exceedingly complex system.
Much of the rest of the research agenda is arguing that this is the case. See especially section 2.8.
But for both these points, more research is still needed.