at least absent some clear and not-just-species-ist story about why AIs-with-different-values should be excluded
My strongest reason for this is to preserve moral option value, in other words, preserving our options/resources to eventually do what is right. Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right. But too bad, they’re then controlling only a small fraction of the universe, the rest having been voluntarily or involuntarily handed off to AIs-with-different-values. What if this scenario turns out to constitute some kind of moral catastrophe? Wouldn’t it be best to have preserved optionality instead? We can always hand off power or parts of the universe to AIs-with-different-values later, if/when we, upon full reflection, decide that is actually the right thing to do.
I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky’s “programmers” to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue “sure, maybe it will ultimately make sense to empower the rest of humanity, but if that’s right, then my CEV will tell me that and I can go do it. But if it’s not right, I’ll be glad I first just empowered myself and figured out my own CEV, lest I end up giving away too many resources up front.”
That is, my point in the post is that absent direct speciesism, the main arguments for the programmers including all of humanity in the CEV “extrapolation base,” rather than just doing their own CEV, apply symmetrically to AIs-we’re-sharing-the-world-with at the time of the relevant thought-experimental power-allocation. And I think this point applies to “option value” as well.
the main arguments for the programmers including all of [current?] humanity in the CEV “extrapolation base” […] apply symmetrically to AIs-we’re-sharing-the-world-with at the time
I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.
If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn’t the kind of thing that gets your more control of the future.
“Sorry clippy, we do want you to get some paperclips, we just don’t want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn’t seem to be a very fair way to allocate the future.” — and in the same breath, “Sorry Putin, we do want you to get some of whatever-intrinsic-values-you’re-trying-to-satisfy, we just don’t want you to get as much as ruthlessly ruling Russia can get you, because that doesn’t seem to be a very fair way to allocate the future.”
And this can apply regardless of how much of clippy already exists by the time you’re doing CEV.
Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they’ll decide to share power even after reflecting correctly. Thus “programmers” who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you’ve been talking about giving AIs more power than they already have. “the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with”)
The “programmers” not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don’t have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.
To the extent that these considerations don’t justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)
It doesn’t have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than “solve everything on your own inside CEV” might exist, they can figure those out and defer to them from inside CEV. At least that’s the case in my own attempts at implementing CEV in math (eg QACI).
Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
If you’re pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.
I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can’t do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to “find someone smart and reasonable and likely to have good goal-content integrity”, which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we’re gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.
Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won’t be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.
Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
re: hard to know—it seems to me that we can’t get a certifiably-going-to-be-good result from a CEV based ai solution unless we can make it certifiable that altruism is present. I think figuring out how to write down some form of what altruism is, especially altruism in contrast to being-a-pushover, is necessary to avoid issues—because even if any person considers themselves for CEV, how would they know they can trust their own behavior?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values. can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values.
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can’t definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
Is it better to have no competition or some competition, and what kind? (Past “moral/philosophical progress” might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past “progress” might have been driven by people motivated by certain kinds of status.)
can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what “correct philosophical reasoning” is, we can solve this problem by asking “What would this chunk of matter conclude if it were to follow correct philosophical reasoning?”
Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right.
But what is the reason to think that we or our descendants would have a better chance of finding this kind of “definitely right” moral theory or utility function than other AIs or their descendants?
In some sense, the point of OP is that the difference between “us” and “not-us” here might be more nebulous than we usually believe, and that a more equal treatment is called for.
Otherwise, one might also argue (in a symmetric fashion) that we would destroy moral option value by preventing other entities who might have a better chance of building or becoming “superintelligent super-competent philosophers” from having a shot at that...
But what is the reason to think that we or our descendants would have a better chance of finding this kind of “definitely right” moral theory or utility function than other AIs or their descendants?
Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs. I’ll reevaluate my position if that changes, with the caveat that I want some reassurance that the AI is doing correct philosophy, not just optimizing to persuade me or humans in general (which I’m afraid will be the default).
What is the right morality may be partly or wholly subjective (I’m not sure), in which case AIs will end up converging to different moral conclusions from us, independently of philosophical competence, and from our perspective, the right thing to do would be to follow our own conclusions.
But I don’t know to what extent productive studies in philosophy at the top level of competence in philosophy are at all compatible with safety concerns. It’s not an accident that people using base models show nice progress in joint human-AI philosophical brainstorms, whereas people using tamed models seem to be saying that those models are not creative enough, and that those models don’t think in sufficiently non-standard ways.
It’s might be a fundamental problem which might not have anything to do with human-AI differences. For example, Nietzsche is an important radical philosopher, and we need biological or artificial philosophers performing not just on that level, but on a higher level than that, if we want them to properly address fundamental problems, but Nietzsche is not “safe” in any way, shape, or form.
Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs.
Hybrid philosophical discourse done by human-AI collaborations can be very good. For example, I feel that Janus has been doing very strong work in this sense with base models (so, not with RLHF’d, Constitutional, or otherwise “lesioned” and “mode-collapsed” models we tend to mostly use these days).
But, indeed, this does not tell us much about what would AIs do on their own.
My strongest reason for this is to preserve moral option value, in other words, preserving our options/resources to eventually do what is right. Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right. But too bad, they’re then controlling only a small fraction of the universe, the rest having been voluntarily or involuntarily handed off to AIs-with-different-values. What if this scenario turns out to constitute some kind of moral catastrophe? Wouldn’t it be best to have preserved optionality instead? We can always hand off power or parts of the universe to AIs-with-different-values later, if/when we, upon full reflection, decide that is actually the right thing to do.
I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky’s “programmers” to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue “sure, maybe it will ultimately make sense to empower the rest of humanity, but if that’s right, then my CEV will tell me that and I can go do it. But if it’s not right, I’ll be glad I first just empowered myself and figured out my own CEV, lest I end up giving away too many resources up front.”
That is, my point in the post is that absent direct speciesism, the main arguments for the programmers including all of humanity in the CEV “extrapolation base,” rather than just doing their own CEV, apply symmetrically to AIs-we’re-sharing-the-world-with at the time of the relevant thought-experimental power-allocation. And I think this point applies to “option value” as well.
I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.
If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn’t the kind of thing that gets your more control of the future.
“Sorry clippy, we do want you to get some paperclips, we just don’t want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn’t seem to be a very fair way to allocate the future.” — and in the same breath, “Sorry Putin, we do want you to get some of whatever-intrinsic-values-you’re-trying-to-satisfy, we just don’t want you to get as much as ruthlessly ruling Russia can get you, because that doesn’t seem to be a very fair way to allocate the future.”
And this can apply regardless of how much of clippy already exists by the time you’re doing CEV.
The main asymmetries I see are:
Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they’ll decide to share power even after reflecting correctly. Thus “programmers” who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you’ve been talking about giving AIs more power than they already have. “the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with”)
The “programmers” not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don’t have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.
To the extent that these considerations don’t justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)
It doesn’t have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than “solve everything on your own inside CEV” might exist, they can figure those out and defer to them from inside CEV. At least that’s the case in my own attempts at implementing CEV in math (eg QACI).
Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
If you’re pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.
I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can’t do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to “find someone smart and reasonable and likely to have good goal-content integrity”, which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we’re gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won’t be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.
Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)
some fragments:
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
re: hard to know—it seems to me that we can’t get a certifiably-going-to-be-good result from a CEV based ai solution unless we can make it certifiable that altruism is present. I think figuring out how to write down some form of what altruism is, especially altruism in contrast to being-a-pushover, is necessary to avoid issues—because even if any person considers themselves for CEV, how would they know they can trust their own behavior?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values. can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
I’m very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can’t definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
Is it better to have no competition or some competition, and what kind? (Past “moral/philosophical progress” might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past “progress” might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)
I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what “correct philosophical reasoning” is, we can solve this problem by asking “What would this chunk of matter conclude if it were to follow correct philosophical reasoning?”
But what is the reason to think that we or our descendants would have a better chance of finding this kind of “definitely right” moral theory or utility function than other AIs or their descendants?
In some sense, the point of OP is that the difference between “us” and “not-us” here might be more nebulous than we usually believe, and that a more equal treatment is called for.
Otherwise, one might also argue (in a symmetric fashion) that we would destroy moral option value by preventing other entities who might have a better chance of building or becoming “superintelligent super-competent philosophers” from having a shot at that...
Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs. I’ll reevaluate my position if that changes, with the caveat that I want some reassurance that the AI is doing correct philosophy, not just optimizing to persuade me or humans in general (which I’m afraid will be the default).
So far AI capabilities seem more tilted towards technological progress than philosophical progress (compared to humans). See also AI doing philosophy = AI generating hands? for more reasons to worry about this. Under these circumstances it seems very easy to permanently mess up the trajectory of philosophical progress, for example by locking in one’s current conception of what’s right, or inventing new technology capable of corrupting everyone’s values without knowing how to defend against that.
What is the right morality may be partly or wholly subjective (I’m not sure), in which case AIs will end up converging to different moral conclusions from us, independently of philosophical competence, and from our perspective, the right thing to do would be to follow our own conclusions.
But I don’t know to what extent productive studies in philosophy at the top level of competence in philosophy are at all compatible with safety concerns. It’s not an accident that people using base models show nice progress in joint human-AI philosophical brainstorms, whereas people using tamed models seem to be saying that those models are not creative enough, and that those models don’t think in sufficiently non-standard ways.
It’s might be a fundamental problem which might not have anything to do with human-AI differences. For example, Nietzsche is an important radical philosopher, and we need biological or artificial philosophers performing not just on that level, but on a higher level than that, if we want them to properly address fundamental problems, but Nietzsche is not “safe” in any way, shape, or form.
Thanks, that’s very informative.
Hybrid philosophical discourse done by human-AI collaborations can be very good. For example, I feel that Janus has been doing very strong work in this sense with base models (so, not with RLHF’d, Constitutional, or otherwise “lesioned” and “mode-collapsed” models we tend to mostly use these days).
But, indeed, this does not tell us much about what would AIs do on their own.