For “1. Why would CEV be difficult to learn?”: I’m not an alignment researcher, so someone might be cringing at my answers. That said, responding to some aspects of the initial comment:
Humans are relatively dumb, so why can’t even a relatively dumb AI learn the same ability to distinguish utopias from dystopias?
The problem is not building AIs that are capable of distinguishing human utopias from dystopias—that’s largely a given if you have general intelligence. The problem is building AIs that target human utopia safely first-try. It’s not a matter of giving AIs some internal module native to humans that lets them discern good outcomes from bad outcomes, it’s having them care about that nuance at all.
if CEV is impossible to learn first try, why not shoot for something less ambitious? Value is fragile, OK, but aren’t there easier utopias?
I would suppose (as aforementioned, being empirically bad at this kind of analysis) that the problem is inherent to giving AIs open-ended goals that require wresting control of the Earth and its resources from humans, which is what “shooting for utopia” would involve. Strawberry tasks, being something that naively seems more amenable to things like power-seeking penalties and oversight via interpretability tools, sound easier to perform safely than strict optimization of any particular target.
For “1. Why would CEV be difficult to learn?”: I’m not an alignment researcher, so someone might be cringing at my answers. That said, responding to some aspects of the initial comment:
The problem is not building AIs that are capable of distinguishing human utopias from dystopias—that’s largely a given if you have general intelligence. The problem is building AIs that target human utopia safely first-try. It’s not a matter of giving AIs some internal module native to humans that lets them discern good outcomes from bad outcomes, it’s having them care about that nuance at all.
I would suppose (as aforementioned, being empirically bad at this kind of analysis) that the problem is inherent to giving AIs open-ended goals that require wresting control of the Earth and its resources from humans, which is what “shooting for utopia” would involve. Strawberry tasks, being something that naively seems more amenable to things like power-seeking penalties and oversight via interpretability tools, sound easier to perform safely than strict optimization of any particular target.