I’m going to re-ask all my questions that I don’t think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
For “1. Why would CEV be difficult to learn?”: I’m not an alignment researcher, so someone might be cringing at my answers. That said, responding to some aspects of the initial comment:
Humans are relatively dumb, so why can’t even a relatively dumb AI learn the same ability to distinguish utopias from dystopias?
The problem is not building AIs that are capable of distinguishing human utopias from dystopias—that’s largely a given if you have general intelligence. The problem is building AIs that target human utopia safely first-try. It’s not a matter of giving AIs some internal module native to humans that lets them discern good outcomes from bad outcomes, it’s having them care about that nuance at all.
if CEV is impossible to learn first try, why not shoot for something less ambitious? Value is fragile, OK, but aren’t there easier utopias?
I would suppose (as aforementioned, being empirically bad at this kind of analysis) that the problem is inherent to giving AIs open-ended goals that require wresting control of the Earth and its resources from humans, which is what “shooting for utopia” would involve. Strawberry tasks, being something that naively seems more amenable to things like power-seeking penalties and oversight via interpretability tools, sound easier to perform safely than strict optimization of any particular target.
On the topic of decision theories, is there a decision theory that is “least weird” from a “normal human” perspective? Most people don’t factor alternate universes and people who actually don’t exist into their everyday decision making process, and it seems reasonable that there should be a decision theory that resembles humans in that way.
Normal, standard causal decision theory is probably it. You can make a case that people sometimes intuitively use evidential decision theory (“Do it. You’ll be glad you did.”) but if asked to spell out their decision making process, most would probably describe causal decision theory.
People also sometimes use fdt: “don’t throw away that particular piece of trash onto the road! If everyone did that we would live among trash heaps!” Of course throwing away one piece of trash would not directly (mostly) cause others to throw away their trash, the reasoning is using the subjunctive dependence between one’s action and others’ action mediated through human morality and comparing the possible future states’ desirability.
I’m going to re-ask all my questions that I don’t think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
Why would CEV be difficult to learn?
Why is research into decision theories relevant to alignment?
Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state?
Is recursive self-alignment possible?
Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment?
For “1. Why would CEV be difficult to learn?”: I’m not an alignment researcher, so someone might be cringing at my answers. That said, responding to some aspects of the initial comment:
The problem is not building AIs that are capable of distinguishing human utopias from dystopias—that’s largely a given if you have general intelligence. The problem is building AIs that target human utopia safely first-try. It’s not a matter of giving AIs some internal module native to humans that lets them discern good outcomes from bad outcomes, it’s having them care about that nuance at all.
I would suppose (as aforementioned, being empirically bad at this kind of analysis) that the problem is inherent to giving AIs open-ended goals that require wresting control of the Earth and its resources from humans, which is what “shooting for utopia” would involve. Strawberry tasks, being something that naively seems more amenable to things like power-seeking penalties and oversight via interpretability tools, sound easier to perform safely than strict optimization of any particular target.
On the topic of decision theories, is there a decision theory that is “least weird” from a “normal human” perspective? Most people don’t factor alternate universes and people who actually don’t exist into their everyday decision making process, and it seems reasonable that there should be a decision theory that resembles humans in that way.
Normal, standard causal decision theory is probably it. You can make a case that people sometimes intuitively use evidential decision theory (“Do it. You’ll be glad you did.”) but if asked to spell out their decision making process, most would probably describe causal decision theory.
People also sometimes use fdt: “don’t throw away that particular piece of trash onto the road! If everyone did that we would live among trash heaps!” Of course throwing away one piece of trash would not directly (mostly) cause others to throw away their trash, the reasoning is using the subjunctive dependence between one’s action and others’ action mediated through human morality and comparing the possible future states’ desirability.