To get the obvious disclaimer out of the way: I don’t actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we’re considering questions like this, that means we’ve reached superintelligence, and we’ll either trust the AIs to be better than us at thinking about these sorts of questions, or we’ll be screwed regardless of what we do.[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don’t want to incentivize any of that sort of hacking either.
What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There’s a bunch of random chance here that imo shouldn’t be morally relevant.
More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.
I’m generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though).
(Interesting. FWIW I’ve recently been thinking that it’s a mistake to think of this type of thing—”what to do after the acute risk period is safed”—as being a waste of time / irrelevant; it’s actually pretty important, specifically because you want people trying to advance AGI capabilities to have an alternative, actually-good vision of things. A hypothesis I have is that many of them are in a sense genuinely nihilistic/accelerationist; “we can’t imagine the world after AGI, so we can’t imagine it being good, so it cannot be good, so there is no such thing as a good future, so we cannot be attached to a good future, so we should accelerate because that’s just what is happening”.)
It seems more elegant (and perhaps less fraught) to have the reference class determination itself be a first class part of the regular CEV process.
For example, start with a rough set of ~all alive humans above a certain development threshold at a particular future moment, and then let the set contract or expand according to their extrapolated volition. Perhaps the set or process they arrive at will be like the one you describe, perhaps not. But I suspect the answer to questions about how much to weight the preferences (or extrapolated CEVs) of distant ancestors and / or “edge cases” like the ones you describe in (b) and (c) wouldn’t be affected too much by the exact starting conditions either way.
Re: the point about hackability and tyranny, humans already have plenty of mundane / naturalistic reasons to seek power / influence / spread of their own particular current values, absent any consideration about manipulating a reference class for a future CEV. Pushing more of the CEV process into the actual CEV itself minimizes the amount of further incentive to do these things specifically for CEV reasons. Whereas, if a particular powerful person or faction doesn’t like your proposed lock-in procedure, they now have (more of) an incentive to take power beforehand to manipulate or change it.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of “scoping the reference class” (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn’t have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I’m sure that there’ll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure. In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like “whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details”). And “drastically” means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one’s optimum is worth <10% to the other’s utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.
I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.
re 2a: the set of all currently alive humans is already, uh, “hackable” via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn’t then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die.
re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.
I’m generally of the opinion that CEV was always a bad goal, and that we shouldn’t attempt to do so, and a big reason for this is I don’t believe a procedure exists that doesn’t incentivize humans to fight over the initial dynamic, or another way to say this is that who implements the CEV procedure will always matter, because I don’t believe in the idea that humans will naturally converge in the limit of more intelligence to a fixed moral value system, and instead I predict divergence as constraints are removed.
I roughly agree with Steven Byrnes, but stronger here (though I think this holds beyond humans too):
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
Is there really some particular human whose volition you’d like to coherently extrapolate over eternity but where you refrain because you’re worried it will generate infighting? Or is it more like, you can’t think of anybody you’d pick, so you want a decision procedure to pick for you?
2b/2c. I think I would say that we should want a tyranny of the present to the extent that is in our values upon reflection. If, for example, Rome still existed and took over the world, their CEV should depend on their ethics and population. I think it would still be a very good utopia, but it may also have things we dislike.
Other considerations, like nearby Everett branches… well they don’t exist in this branch? I would endorse game theoretical cooperation with them, but I’m skeptical of any more automatic cooperation than what we already have. That is, this sort of fairness is a part of our values, and CEV (if not adversarially hacked) should represent those already?
I don’t think this would end up in a tyranny anything like the usual form of the word if we’re actually implementing CEV. We have values for people being able to change and adjust over time, and so those are in the CEV.
There may very well be limits to how far we want humanity to change in general, but that’s perfectly allowed to be in our values. Like, as a specific example, some have said that they think global status games will be vastly important in the far future and thus a zero-sum resource. I find it decently likely that an AGI implementing CEV would discourage such, because humans wouldn’t endorse it on reflection, even if it is a plausible default outcome.
Like, essentially my view is: Optimize our-branches’ humanity’s values as hard as possible, this contains desires for other people’s values to be satisfied, and thus they’re represented. Other forms of fairness to things we aren’t completely a fans of can be bargained for (locally, or acausally between branches/whatever).
So that’s my argument against the tyranny and Everett branches part. I’m less skeptical of considering whether to include the recently dead, but I also don’t have a great theory of how to weight them. Those about to be born wouldn’t have a notable effect on CEV, I’d believe.
The option you suggest in #3 is nice, though I think it runs some risks of being dominated or notably influenced by “humans in other very odd branches”, and so we’re outweighed by them despite them not locally existing. I think it is less that you want a human predicate, and more of a “human who has values compatible with this local branch”. This is part of why I advocate just bargaining between branches: if the humans in an AGI-made New Rome want us to instantiate their constructed friendly/restricted AGI Gods locally to proselytize, they can trade for it rather than that faction being automatically divvied out a star by our AGI’s CEV.
“Human who has values compatible with this local branch” feels weak as a definition, arbitrary, but I’m not sure we can do better than that. I imagine we’d even have weightings, because we likely legitimately value baby’s in special ways that don’t entail maxing out reward centers or boosting them to megaminds soon after birth, we have preferences about that. Then of course there’s minds that are sortof humanish, which is why you’d have a weighting.
(This is kinda rambly, but I do think a lot of this can be avoided with just plain CEV because I think most people on reflection would end up with “reevaluate whether the deal was fair with reflection and then adjust the deal and reference class based on that”.)
Some random thoughts on CEV:
To get the obvious disclaimer out of the way: I don’t actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we’re considering questions like this, that means we’ve reached superintelligence, and we’ll either trust the AIs to be better than us at thinking about these sorts of questions, or we’ll be screwed regardless of what we do.[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don’t want to incentivize any of that sort of hacking either.
What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There’s a bunch of random chance here that imo shouldn’t be morally relevant.
More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.
I’m generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though).
(Interesting. FWIW I’ve recently been thinking that it’s a mistake to think of this type of thing—”what to do after the acute risk period is safed”—as being a waste of time / irrelevant; it’s actually pretty important, specifically because you want people trying to advance AGI capabilities to have an alternative, actually-good vision of things. A hypothesis I have is that many of them are in a sense genuinely nihilistic/accelerationist; “we can’t imagine the world after AGI, so we can’t imagine it being good, so it cannot be good, so there is no such thing as a good future, so we cannot be attached to a good future, so we should accelerate because that’s just what is happening”.)
It seems more elegant (and perhaps less fraught) to have the reference class determination itself be a first class part of the regular CEV process.
For example, start with a rough set of ~all alive humans above a certain development threshold at a particular future moment, and then let the set contract or expand according to their extrapolated volition. Perhaps the set or process they arrive at will be like the one you describe, perhaps not. But I suspect the answer to questions about how much to weight the preferences (or extrapolated CEVs) of distant ancestors and / or “edge cases” like the ones you describe in (b) and (c) wouldn’t be affected too much by the exact starting conditions either way.
Re: the point about hackability and tyranny, humans already have plenty of mundane / naturalistic reasons to seek power / influence / spread of their own particular current values, absent any consideration about manipulating a reference class for a future CEV. Pushing more of the CEV process into the actual CEV itself minimizes the amount of further incentive to do these things specifically for CEV reasons. Whereas, if a particular powerful person or faction doesn’t like your proposed lock-in procedure, they now have (more of) an incentive to take power beforehand to manipulate or change it.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of “scoping the reference class” (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn’t have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I’m sure that there’ll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like “whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details”). And “drastically” means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one’s optimum is worth <10% to the other’s utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.
Yeah, I’ve written about that in §2.7.3 here.
I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.
I would go farther, in that we will in practice be picking from this set of outcomes with a lot of arbitrariness, and that this is not removable.
Isn’t this what the “coherent” part is about? (I forget.)
re 2a: the set of all currently alive humans is already, uh, “hackable” via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn’t then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die.
re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.
I’m generally of the opinion that CEV was always a bad goal, and that we shouldn’t attempt to do so, and a big reason for this is I don’t believe a procedure exists that doesn’t incentivize humans to fight over the initial dynamic, or another way to say this is that who implements the CEV procedure will always matter, because I don’t believe in the idea that humans will naturally converge in the limit of more intelligence to a fixed moral value system, and instead I predict divergence as constraints are removed.
I roughly agree with Steven Byrnes, but stronger here (though I think this holds beyond humans too):
https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse
Is there really some particular human whose volition you’d like to coherently extrapolate over eternity but where you refrain because you’re worried it will generate infighting? Or is it more like, you can’t think of anybody you’d pick, so you want a decision procedure to pick for you?
If there is some particular human, who is it?
2b/2c. I think I would say that we should want a tyranny of the present to the extent that is in our values upon reflection. If, for example, Rome still existed and took over the world, their CEV should depend on their ethics and population. I think it would still be a very good utopia, but it may also have things we dislike.
Other considerations, like nearby Everett branches… well they don’t exist in this branch? I would endorse game theoretical cooperation with them, but I’m skeptical of any more automatic cooperation than what we already have. That is, this sort of fairness is a part of our values, and CEV (if not adversarially hacked) should represent those already? I don’t think this would end up in a tyranny anything like the usual form of the word if we’re actually implementing CEV. We have values for people being able to change and adjust over time, and so those are in the CEV. There may very well be limits to how far we want humanity to change in general, but that’s perfectly allowed to be in our values. Like, as a specific example, some have said that they think global status games will be vastly important in the far future and thus a zero-sum resource. I find it decently likely that an AGI implementing CEV would discourage such, because humans wouldn’t endorse it on reflection, even if it is a plausible default outcome.
Like, essentially my view is: Optimize our-branches’ humanity’s values as hard as possible, this contains desires for other people’s values to be satisfied, and thus they’re represented. Other forms of fairness to things we aren’t completely a fans of can be bargained for (locally, or acausally between branches/whatever).
So that’s my argument against the tyranny and Everett branches part. I’m less skeptical of considering whether to include the recently dead, but I also don’t have a great theory of how to weight them. Those about to be born wouldn’t have a notable effect on CEV, I’d believe.
The option you suggest in #3 is nice, though I think it runs some risks of being dominated or notably influenced by “humans in other very odd branches”, and so we’re outweighed by them despite them not locally existing. I think it is less that you want a human predicate, and more of a “human who has values compatible with this local branch”. This is part of why I advocate just bargaining between branches: if the humans in an AGI-made New Rome want us to instantiate their constructed friendly/restricted AGI Gods locally to proselytize, they can trade for it rather than that faction being automatically divvied out a star by our AGI’s CEV.
“Human who has values compatible with this local branch” feels weak as a definition, arbitrary, but I’m not sure we can do better than that. I imagine we’d even have weightings, because we likely legitimately value baby’s in special ways that don’t entail maxing out reward centers or boosting them to megaminds soon after birth, we have preferences about that. Then of course there’s minds that are sortof humanish, which is why you’d have a weighting.
(This is kinda rambly, but I do think a lot of this can be avoided with just plain CEV because I think most people on reflection would end up with “reevaluate whether the deal was fair with reflection and then adjust the deal and reference class based on that”.)