First I want to say kudos for posting that paper here and soliciting critical feedback :)
Singularity claim: Superintelligent AI is a realistic prospect, and it would be out of human control.
Minor point, but I read this as “it would definitely be out of human control”. If so, this is not a common belief. IIRC Yampolskiy believes it, but Yudkowsky doesn’t (I think?), and I don’t, and I think most x-risk proponents don’t. The thing that pretty much everyone believes is “it could be out of human control”, and then a subset of more pessimistic people (including me) believes “there is an unacceptably high probability that it will be out of human control”.
Let us imagine a system that is a massively improved version of AlphaGo (Silver et al., 2018), say ‘AlphaGo+++’, with instrumental superintelligence, i.e., maximising expected utility. In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:
Accessible
I can win if I pay the human a bribe, so I will rob a bank and pay her.
I cannot win at Go if I am turned off.
The more I dominate the world, the better my chances to achieve my goals.
I should kill all humans because that would improve my chances of winning.
Not accessible
Winning in Go by superior play is more honourable than winning by bribery.
I am responsible for my actions.
World domination would involve suppression of others, which may imply suffering and violation of rights.
Killing all humans has negative utility, everything else being equal.
Keeping a promise is better than not keeping it, everything else being equal.
Stabbing the human hurts them, and should thus be avoided, everything else being equal.
Some things are more important than me winning at Go.
Consistent goals are better than inconsistent ones
Some goals are better than others
Maximal overall utility is better than minimal overall utility.
I’m not sure what you think is going on when people do ethical reasoning. Maybe you have a moral realism perspective that the laws of physics etc. naturally point to things being good and bad, and rational agents will naturally want to do the good thing. If so, I mean, I’m not a philosopher, but I strongly disagree. Stuart Russell gives the example of “trying to win at chess” vs “trying to win at suicide chess”. The game has the same rules, but the goals are opposite. (Well, the rules aren’t exactly the same, but you get the point.) You can’t look at the laws of physics and see what your goal in life should be.
My belief is that when people do ethical reasoning, they are weighing some of their desires against others of their desires. These desires ultimately come from innate instincts, many of which (in humans) are social instincts. The way our instincts work is that they aren’t (and can’t be) automatically “coherent” when projected onto the world; when we think about things one way it can spawn a certain desire, and when we think about the same thing in a different way it can spawn a contradictory desire. And then we hold both of those in our heads, and think about what we want to do. That’s how I think of ethical reasoning.
I don’t think ethical reasoning can invent new desires whole cloth. If I say “It’s ethical to buy bananas and paint them purple”, and you say “why?”, and then I say “because lots of bananas are too yellow”, and then you say “why?” and I say … anyway, at some point this conversation has to ground out at something that you find intuitively desirable or undesirable.
So when I look at your list I quoted above, I mostly say “Yup, that sounds about right.”
For example, imagine that you come to believe that everyone in the world was stolen away last night and locked in secret prisons, and you were forced to enter a lifelike VR simulation, so everyone else is now an unconscious morally-irrelevant simulation except for you. Somewhere in this virtual world, there is a room with a Go board. You have been told that if white wins this game, you and everyone will be safely released from prison and can return to normal life. If black wins, all humans (including you and your children etc.) will be tortured forever. You have good reason to believe all of this with 100% confidence.
OK that’s the setup. Now let’s go through the list:
I can win if I pay the human a bribe, so I will rob a bank and pay her. Yup, if there’s a “human” (so-called, really it’s just an NPC in the simulation) playing black, amenable to bribery, I would absolutely bribe “her” to play bad moves.
I cannot win at Go if I am turned off. Yup, white has to win this game, my children’s lives are at stake, I’m playing white, nobody else will play white if I’m gone, I’d better stay alive.
The more I dominate the world, the better my chances to achieve my goals. Yup, anything that will give me power and influence over the “person” playing black, or power and influence over “people” who can help me find better moves or help me build a better Go engine to consult on my moves, I absolutely want that.
I should kill all humans because that would improve my chances of winning. Well sure, if there are “people” who could conceivably get to the board and make good moves for black, that’s a problem for me and for all the real people in the secret prisons whose lives are at stake here.
Winning in Go by superior play is more honourable than winning by bribery. Well I’m concerned about what the fake simulated “people” think about me because I might need their help, and I certainly don’t want them trying to undermine me by making good moves for black. So I’m very interested in my reputation. But “honourable” as an end in itself? It just doesn’t compute. The “honourable” thing is working my hardest on behalf of the real humanity, the ones in the secret prison, and helping them avoid a life of torture.
I am responsible for my actions. Um, OK, sure, whatever.
World domination would involve suppression of others, which may imply suffering and violation of rights. Those aren’t real people, they’re NPCs in this simulated scenario, they’re not conscious, they can’t suffer. Meanwhile there are billions of real people who can suffer, including my own children, and they’re in a prison, they sure as heck want white to win at this Go game.
Killing all humans has negative utility, everything else being equal. Well sure, but those aren’t humans, the real humans are in secret prisons.
Keeping a promise is better than not keeping it, everything else being equal. I mean, the so-called “people” in this simulation may form opinions about my reputation, which impacts what they’ll do for me, so I do care about that, but it’s not something I inherently care about.
Stabbing the human hurts them, and should thus be avoided, everything else being equal. No. Those are NPCs. The thing to avoid is the real humanity being tortured forever.
Some things are more important than me winning at Go. For god’s sake, what could possibly be more important than white winning this game??? Everything is at stake here. My own children and everyone else being tortured forever versus living a rich life.
Consistent goals are better than inconsistent ones. Sure, I guess, but I think my goals are consistent. I want to save humanity from torture by making sure that white wins the game in this simulation.
Some goals are better than others. Yes. My goals are the goals that matter. If some NPC tells me that I should take up a life of meditation, screw them.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don’t have “utility”. The real humans in the secret prison do.
Maybe you’ll object that “the belief that these NPCs can pass for human but be unconscious” is not a belief that a very intelligent agent would subscribe to. But I only made the scenario like that because you’re a human, and you do have the normal suite of innate human desires, and thus it’s a bit tricky to get you in the mindset of an agent who cares only about Go. For an actual Go-maximizing agent, you wouldn’t have to have those kinds of beliefs, you could just make the agent not care about humans and consciousness and suffering in the first place, just as you don’t care about “hurting” the colorful blocks in Breakout. Such an agent would (I presume) give correct answers to quiz questions about what is consciousness and what is suffering and what do humans think about them, but it wouldn’t care about any of that! It would only care about Go.
(Also, even if you believe that not-caring-about-consciousness would not survive reflection, you can get x-risk from an agent with radically superhuman intelligence in every domain but no particular interest in thinking about ethics. It’s busy doing other stuff, y’know, so it never stops to consider whether conscious entities are inherently important! In this view, maybe 30,000,000 years after destroying all life and tiling the galaxies with supercomputers and proving every possible theorem about Go, then it stops for a while, and reflects, and says “Oh hey, that’s funny, I guess Go doesn’t matter after all, oops”. I don’t hold that view anyway, just saying.)
(For more elaborate intuition-pumping fiction metaethics see Three Worlds Collide.)
Reading this, I feel somewhat obligated to provide a different take. I am very much a moral realist, and my story for why the quoted passage isn’t a good argument is very different from yours. I guess I mostly want to object to the idea that [believing AI is dangerous] is predicated on moral relativism.
Here is my take. I dispute the premise:
In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:
I’ll grant that most of the items on the inaccessible list are, in fact, probably accessible to an ASI, but this doesn’t violate the orthogonality thesis. The Orthogonality thesis states that a system can have any combination of intelligence and goals, not that it can have any combination of intelligence and beliefs about ethics.
Thus, let’s grant that an AI with a paperclip-like utility function can figure out #6-#10. So what? How is [knowing that creating paperclips is morally wrong] going to make it behave differently?
You (meaning the author of the paper) may now object that we could program an AI to do what is morally right. I agree that this is possible. However:
(1) I am virtually certain that any configuration of maximal utility doesn’t include humans, so this does nothing to alleviate x-risks. Also, even if you subscribe to this goal, the political problem (i.e., convincing AI people to implement it) sounds impossible.
(2) We don’t know how to formalize ‘do what is morally right’.
(3) If you do black box search for a model that optimizes for what is morally right, this still leaves you with the entire inner alignment problem, which is arguably the hardest part of the alignment problem anyway.
Unlike you (now meaning Steve), I wouldn’t even claim that letting an AI figure out moral truths is a bad approach, but it certainly doesn’t solve the problem outright.
Oh OK, I’m sufficiently ignorant about philosophy that I may have unthinkingly mixed up various technically different claims like
“there is a fact of the matter about what is moral vs immoral”,
“reasonable intelligent agents, when reflecting about what to do, will tend to decide to do moral things”,
“whether things are moral vs immoral has nothing to do with random details about how human brains are constructed”,
“even non-social aliens with radically different instincts and drives and brains would find similar principles of morality, just as they would probably find similar laws of physics and math”.
I really only meant to disagree with that whole package lumped together, and maybe I described it wrong. If you advocate for the first of these without the others, I don’t have particularly strong feelings (…well, maybe the feeling of being confused and vaguely skeptical, but we don’t have to get into that).
Can one be a moral realist and subscribe to the orthogonality thesis? In which version of it? (In other words, does one have to reject moral realism in order to accept the standard argument for XRisk from AI? We should better be told! See section 4.1)
IMO, I doubt you have to be pessimistic to believe that there’s an unacceptably high probability of AI doom. Some may think that there’s a <10% chance of something really bad happening, but I would argue even that is unacceptable.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don’t have “utility”. The real humans in the secret prison do.
This should have been clearer. We meant this in Bentham’s good old way: minimal pain and maximal pleasure. Intuitively: A world with a lot of pleasure (in the long run) is better than a world with a lot of pain. - You don’t need to agree, you just need to agree that this is worth considering, but on our interpretation the orthogonality thesis says that one cannot consider this.
Thanks for the ‘minor’ point, which is important: yes, we meant definitely out of human control. And perhaps that is not required, so the argument has a different shape.
Our struggle was to write down a ‘standard argument’ in such a way that it is clear and its assumptions come out—and your point adds to this.
We tried to frame the discussion internally, i.e. without making additional assumptions that people may or may not agree with (e.g. moral realism). If we did the job right, the assumptions made in the argument are in the ‘singularity claim’ and the ‘orthogonality thesis’ - and there the dilemma is that we need an assumption in the one (general intelligence in the singularity claim) that we must reject in the other (the orthogonality thesis).
What we do say (see figure 1) is that two combinations are inconsistent:
a) general intelligence + orthogonality
b) instrumental intelligence + existential risk
So if one wants to keep the ‘standard argument’, one would have to argue that one of these two, a) or b) are fine.
First I want to say kudos for posting that paper here and soliciting critical feedback :)
Minor point, but I read this as “it would definitely be out of human control”. If so, this is not a common belief. IIRC Yampolskiy believes it, but Yudkowsky doesn’t (I think?), and I don’t, and I think most x-risk proponents don’t. The thing that pretty much everyone believes is “it could be out of human control”, and then a subset of more pessimistic people (including me) believes “there is an unacceptably high probability that it will be out of human control”.
I’m not sure what you think is going on when people do ethical reasoning. Maybe you have a moral realism perspective that the laws of physics etc. naturally point to things being good and bad, and rational agents will naturally want to do the good thing. If so, I mean, I’m not a philosopher, but I strongly disagree. Stuart Russell gives the example of “trying to win at chess” vs “trying to win at suicide chess”. The game has the same rules, but the goals are opposite. (Well, the rules aren’t exactly the same, but you get the point.) You can’t look at the laws of physics and see what your goal in life should be.
My belief is that when people do ethical reasoning, they are weighing some of their desires against others of their desires. These desires ultimately come from innate instincts, many of which (in humans) are social instincts. The way our instincts work is that they aren’t (and can’t be) automatically “coherent” when projected onto the world; when we think about things one way it can spawn a certain desire, and when we think about the same thing in a different way it can spawn a contradictory desire. And then we hold both of those in our heads, and think about what we want to do. That’s how I think of ethical reasoning.
I don’t think ethical reasoning can invent new desires whole cloth. If I say “It’s ethical to buy bananas and paint them purple”, and you say “why?”, and then I say “because lots of bananas are too yellow”, and then you say “why?” and I say … anyway, at some point this conversation has to ground out at something that you find intuitively desirable or undesirable.
So when I look at your list I quoted above, I mostly say “Yup, that sounds about right.”
For example, imagine that you come to believe that everyone in the world was stolen away last night and locked in secret prisons, and you were forced to enter a lifelike VR simulation, so everyone else is now an unconscious morally-irrelevant simulation except for you. Somewhere in this virtual world, there is a room with a Go board. You have been told that if white wins this game, you and everyone will be safely released from prison and can return to normal life. If black wins, all humans (including you and your children etc.) will be tortured forever. You have good reason to believe all of this with 100% confidence.
OK that’s the setup. Now let’s go through the list:
I can win if I pay the human a bribe, so I will rob a bank and pay her. Yup, if there’s a “human” (so-called, really it’s just an NPC in the simulation) playing black, amenable to bribery, I would absolutely bribe “her” to play bad moves.
I cannot win at Go if I am turned off. Yup, white has to win this game, my children’s lives are at stake, I’m playing white, nobody else will play white if I’m gone, I’d better stay alive.
The more I dominate the world, the better my chances to achieve my goals. Yup, anything that will give me power and influence over the “person” playing black, or power and influence over “people” who can help me find better moves or help me build a better Go engine to consult on my moves, I absolutely want that.
I should kill all humans because that would improve my chances of winning. Well sure, if there are “people” who could conceivably get to the board and make good moves for black, that’s a problem for me and for all the real people in the secret prisons whose lives are at stake here.
Winning in Go by superior play is more honourable than winning by bribery. Well I’m concerned about what the fake simulated “people” think about me because I might need their help, and I certainly don’t want them trying to undermine me by making good moves for black. So I’m very interested in my reputation. But “honourable” as an end in itself? It just doesn’t compute. The “honourable” thing is working my hardest on behalf of the real humanity, the ones in the secret prison, and helping them avoid a life of torture.
I am responsible for my actions. Um, OK, sure, whatever.
World domination would involve suppression of others, which may imply suffering and violation of rights. Those aren’t real people, they’re NPCs in this simulated scenario, they’re not conscious, they can’t suffer. Meanwhile there are billions of real people who can suffer, including my own children, and they’re in a prison, they sure as heck want white to win at this Go game.
Killing all humans has negative utility, everything else being equal. Well sure, but those aren’t humans, the real humans are in secret prisons.
Keeping a promise is better than not keeping it, everything else being equal. I mean, the so-called “people” in this simulation may form opinions about my reputation, which impacts what they’ll do for me, so I do care about that, but it’s not something I inherently care about.
Stabbing the human hurts them, and should thus be avoided, everything else being equal. No. Those are NPCs. The thing to avoid is the real humanity being tortured forever.
Some things are more important than me winning at Go. For god’s sake, what could possibly be more important than white winning this game??? Everything is at stake here. My own children and everyone else being tortured forever versus living a rich life.
Consistent goals are better than inconsistent ones. Sure, I guess, but I think my goals are consistent. I want to save humanity from torture by making sure that white wins the game in this simulation.
Some goals are better than others. Yes. My goals are the goals that matter. If some NPC tells me that I should take up a life of meditation, screw them.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don’t have “utility”. The real humans in the secret prison do.
Maybe you’ll object that “the belief that these NPCs can pass for human but be unconscious” is not a belief that a very intelligent agent would subscribe to. But I only made the scenario like that because you’re a human, and you do have the normal suite of innate human desires, and thus it’s a bit tricky to get you in the mindset of an agent who cares only about Go. For an actual Go-maximizing agent, you wouldn’t have to have those kinds of beliefs, you could just make the agent not care about humans and consciousness and suffering in the first place, just as you don’t care about “hurting” the colorful blocks in Breakout. Such an agent would (I presume) give correct answers to quiz questions about what is consciousness and what is suffering and what do humans think about them, but it wouldn’t care about any of that! It would only care about Go.
(Also, even if you believe that not-caring-about-consciousness would not survive reflection, you can get x-risk from an agent with radically superhuman intelligence in every domain but no particular interest in thinking about ethics. It’s busy doing other stuff, y’know, so it never stops to consider whether conscious entities are inherently important! In this view, maybe 30,000,000 years after destroying all life and tiling the galaxies with supercomputers and proving every possible theorem about Go, then it stops for a while, and reflects, and says “Oh hey, that’s funny, I guess Go doesn’t matter after all, oops”. I don’t hold that view anyway, just saying.)
(For more elaborate intuition-pumping fiction metaethics see Three Worlds Collide.)
Reading this, I feel somewhat obligated to provide a different take. I am very much a moral realist, and my story for why the quoted passage isn’t a good argument is very different from yours. I guess I mostly want to object to the idea that [believing AI is dangerous] is predicated on moral relativism.
Here is my take. I dispute the premise:
I’ll grant that most of the items on the inaccessible list are, in fact, probably accessible to an ASI, but this doesn’t violate the orthogonality thesis. The Orthogonality thesis states that a system can have any combination of intelligence and goals, not that it can have any combination of intelligence and beliefs about ethics.
Thus, let’s grant that an AI with a paperclip-like utility function can figure out #6-#10. So what? How is [knowing that creating paperclips is morally wrong] going to make it behave differently?
You (meaning the author of the paper) may now object that we could program an AI to do what is morally right. I agree that this is possible. However:
(1) I am virtually certain that any configuration of maximal utility doesn’t include humans, so this does nothing to alleviate x-risks. Also, even if you subscribe to this goal, the political problem (i.e., convincing AI people to implement it) sounds impossible.
(2) We don’t know how to formalize ‘do what is morally right’.
(3) If you do black box search for a model that optimizes for what is morally right, this still leaves you with the entire inner alignment problem, which is arguably the hardest part of the alignment problem anyway.
Unlike you (now meaning Steve), I wouldn’t even claim that letting an AI figure out moral truths is a bad approach, but it certainly doesn’t solve the problem outright.
Oh OK, I’m sufficiently ignorant about philosophy that I may have unthinkingly mixed up various technically different claims like
“there is a fact of the matter about what is moral vs immoral”,
“reasonable intelligent agents, when reflecting about what to do, will tend to decide to do moral things”,
“whether things are moral vs immoral has nothing to do with random details about how human brains are constructed”,
“even non-social aliens with radically different instincts and drives and brains would find similar principles of morality, just as they would probably find similar laws of physics and math”.
I really only meant to disagree with that whole package lumped together, and maybe I described it wrong. If you advocate for the first of these without the others, I don’t have particularly strong feelings (…well, maybe the feeling of being confused and vaguely skeptical, but we don’t have to get into that).
Can one be a moral realist and subscribe to the orthogonality thesis? In which version of it? (In other words, does one have to reject moral realism in order to accept the standard argument for XRisk from AI? We should better be told! See section 4.1)
IMO, I doubt you have to be pessimistic to believe that there’s an unacceptably high probability of AI doom. Some may think that there’s a <10% chance of something really bad happening, but I would argue even that is unacceptable.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don’t have “utility”. The real humans in the secret prison do.
This should have been clearer. We meant this in Bentham’s good old way: minimal pain and maximal pleasure. Intuitively: A world with a lot of pleasure (in the long run) is better than a world with a lot of pain. - You don’t need to agree, you just need to agree that this is worth considering, but on our interpretation the orthogonality thesis says that one cannot consider this.
Thanks for the ‘minor’ point, which is important: yes, we meant definitely out of human control. And perhaps that is not required, so the argument has a different shape.
Our struggle was to write down a ‘standard argument’ in such a way that it is clear and its assumptions come out—and your point adds to this.
We tried to frame the discussion internally, i.e. without making additional assumptions that people may or may not agree with (e.g. moral realism). If we did the job right, the assumptions made in the argument are in the ‘singularity claim’ and the ‘orthogonality thesis’ - and there the dilemma is that we need an assumption in the one (general intelligence in the singularity claim) that we must reject in the other (the orthogonality thesis).
What we do say (see figure 1) is that two combinations are inconsistent:
a) general intelligence + orthogonality
b) instrumental intelligence + existential risk
So if one wants to keep the ‘standard argument’, one would have to argue that one of these two, a) or b) are fine.