The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
Why does the infinite limit of value learning matter if we’re doing soft optimization against a fixed utility distribution?
Sorry, I didn’t realize this and I was responding independently to Charlie Steiner.