If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).
Arguably, you can’t fully align with inconsistent preferences
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That’s still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
Hadn’t thought about it this way. Partially updated (but still unsure what I think).
I didn’t reply to this originally, probably because I think it’s all pretty reasonable.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem.
My thinking on this is pretty open. In some sense, everything is extrapolation (you don’t exactly “currently” have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible.
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information)
Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what’s good.
under what sense of the word is it “unaligned” with me?
I’m thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we’re shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of “better” reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).
“Arguably, you can’t fully align with inconsistent preferences”
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
I’ve been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler’s thoughts on what he has called “pareto-topia”. (I haven’t gotten anywhere thinking about this because I’m spending my time on other things.)
Yeah, I think something like this is pretty important. Another reason is that humans inherently don’t like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people.
An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).
That all seems pretty fair.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
Hadn’t thought about it this way. Partially updated (but still unsure what I think).
I didn’t reply to this originally, probably because I think it’s all pretty reasonable.
My thinking on this is pretty open. In some sense, everything is extrapolation (you don’t exactly “currently” have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible.
Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what’s good.
I’m thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we’re shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of “better” reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).
I’ve been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler’s thoughts on what he has called “pareto-topia”. (I haven’t gotten anywhere thinking about this because I’m spending my time on other things.)
Yeah, I think something like this is pretty important. Another reason is that humans inherently don’t like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people.
An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).
See my other reply about pseudo-pareto improvements—but I think the “understood + endorsed” idea is really important, and worth further thought.