It requires understanding human preferences in domains where humans are typically very uncertain, and where our answers to simple questions are often inconsistent, like how we should balance our own welfare with the welfare of others, or what kinds of activities we really want to pursue vs. enjoying in the moment.
The easy goal inference problem: Given no algorithmic limitations and access to the complete human policy — a lookup table of what a human would do after making any sequence of observations — find any reasonable representation of any reasonable approximation to what that human wants.
This seems similar to what moral philosophers do: They examine their moral intuitions, and the moral intuitions of other humans, and attempt to construct models that approximate those intuitions across a variety of scenarios.
(I think the difference between what moral philosophers do and the problem you’ve outlined is that moral philosophers typically work from explicitly stated human preferences, whereas the problem you’ve outlined involves inferring revealed preferences implicitly. I like explicitly stated human preferences better, for the same reason I’d rather program an FAI using an explicit, “non-magical” programming language like Python rather than an implicit, “magical” programming language like Ruby.)
Coming up with a single moral theory that captures all our moral intuitions has proven difficult. The best approach may be a “parliament” that aggregates recommendations from a variety of different moral theories. This parallels the idea of an ensemble in machine learning.
I don’t think it is necessary for the ensemble to know the correct answer 100% of the time. If some of the models in the ensemble think an action is immoral, and others think it is moral, then we can punt and ask the human overseer. Ideally, the system anticipates moral difficulties and asks us about them before they arise, so it’s competitive for making time-sensitive decisions.
From The easy goal inference problem is still hard
This seems similar to what moral philosophers do: They examine their moral intuitions, and the moral intuitions of other humans, and attempt to construct models that approximate those intuitions across a variety of scenarios.
(I think the difference between what moral philosophers do and the problem you’ve outlined is that moral philosophers typically work from explicitly stated human preferences, whereas the problem you’ve outlined involves inferring revealed preferences implicitly. I like explicitly stated human preferences better, for the same reason I’d rather program an FAI using an explicit, “non-magical” programming language like Python rather than an implicit, “magical” programming language like Ruby.)
Coming up with a single moral theory that captures all our moral intuitions has proven difficult. The best approach may be a “parliament” that aggregates recommendations from a variety of different moral theories. This parallels the idea of an ensemble in machine learning.
I don’t think it is necessary for the ensemble to know the correct answer 100% of the time. If some of the models in the ensemble think an action is immoral, and others think it is moral, then we can punt and ask the human overseer. Ideally, the system anticipates moral difficulties and asks us about them before they arise, so it’s competitive for making time-sensitive decisions.