Prase, Chris, I don’t understand. Eliezer’s example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.
Basically, we’re being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can’t even put them to good use).
The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can’t imagine how that could be true. Being a paperclip maximizer, it’s bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can’t imagine how that could be true either.
7 years late, but you’re missing the fact that (C,C) is universally better than (D,D). Thus whatever logic is being used must have a flaw somewhere because it works out worse for everyone—a reasoning process that successfully gets both parties to cooperate is a WIN. (However, in this setup it is the case that actually winning would be either (C,D) or (C,D), both of which are presumably impossible if we’re equally rational).
I think what might be confusing is that your decision depends on what you know about the paperclip maximizer. When I imagine myself in this situation, I imagine wanting to say that I know “nothing”. The trick is, if you want to go a step more formal than going with your gut, you have to say what your model of knowing “nothing” is here.
If you know (with high enough probability), for instance, that there is no constraint either causal or logical between your decision and Clippy’s, and that you will not play an iterated game, and that there are no secondary effects, then I think D is indeed the correct choice.
If you know that you and Clippy are both well-modeled by instances of “rational agents of type X” who have a logical constraint between your decisions so that you will both decide the same thing (with high enough probability), then C is the correct choice. You might have strong reasons to think that almost all agents capable of paperclip maximizing at the level of Clippy fall into this group, so that you choose C.
(And more options than those two.)
The way I’d model knowing nothing in the scenario in my head would be something like the first option, so I’d choose D, but maybe there’s other information you can get that suggests that Clippy will mirror you, so that you should choose C.
It does seem like implied folk-lore that “rational agents cooperate”, and it certainly seems true for humans in most circumstances, or formally in some circumstances where you have knowledge about the other agent. But I don’t think it should be true in principal that “optimization processes of high power will, with high probability, mirror decisions in the one-shot prisoner’s dilemma”; I imagine you’d have to put a lot more conditions on it. I’d be very interested to know otherwise.
I understood that Clippy is a rational agent, just one with a different utility function. The payoff matrix as described is the classic Prisoner’s dilemma where one billion lives is one human utilon and one paperclip on Clippy utilon; since we’re both trying to maximise utilons, and we’re supposedly both good at this we should settle for (C,C) over (D,D).
Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(D,D);(C,D) and Clippy run like this: (C,D);(C,C);(D,D);(D,C). This should make it clear that no matter what assumptions we make about Clippy, it is universally better to co-operate than defect. The two asymmetrical outputs can be eliminated on the grounds of being impossible if we’re both rational, and then defecting no longer makes any sense.
I agree it is better if both agents cooperate rather than both defect, and that it is rational to choose (C,C) over (D,D) if you can (as in the TDT example of an agent playing against itself). However, depending on how Clippy is built, you may not have that choice; the counter-factual may be (D,D) or (C,D) [win for Clippy].
I think “Clippy is a rational agent” is the phrase where the details lie. What type of rational agent, and what do you two know about each other? If you ever meet a powerful paperclip maximizer, say “he’s a rational agent like me”, and press C, how surprised would you be if it presses D?
In reality, not very surprised. I’d probably be annoyed/infuriated depending on whether the actual stakes are measured in billions of human lives.
Nevertheless, that merely represents the fact that I am not 100% certain about my reasoning. I do still maintain that rationality in this context definitely implies trying to maximise utility (even if you don’t literally define rationality this way, any version of rationality that doesn’t try to maximise when actually given a payoff matrix is not worthy of the term) and so we should expect that Clippy faces a similar decision to us, but simply favours the paperclips over human lives. If we translate from lives and clips to actual utility, we get the normal prisoner’s dilemma matrix—we don’t need to make any assumptions about Clippy.
In short, I feel that the requirement that both agents are rational is sufficient to rule out the asymmetrical options as possible, and clearly sufficient to show (C,C) > (D,D). I get the feeling this is where we’re disagreeing and that you think we need to make additional assumptions about Clippy to assure the former.
It’s an appealing notion, but i think the logic doesn’t hold up.
In simplest terms: if you apply this logic and choose to cooperate, then the machine can still defect. That will net more paperclips for the machine, so it’s hard to claim that the machine’s actions are irrational.
Although your logic is appealing, it doesn’t explain why the machine can’t defect while you co-operate.
You said that if both agents are rational, then option (C,D) isn’t possible. The corollary is that if option (C,D) is selected, then one of the agents isn’t being rational. If this happens, then the machine hasn’t been irrational (it receives its best possible result). The conclusion is that when you choose to cooperate, you were being irrational.
You’ve successfully explained that (C, D) and (D, C) arw impossible for rational agents, but you seem to have implicitly assumed that (C, C) was possible for rational agents. That’s actually the point that we’re hoping to prove, so it’s a case of circular logic.
Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(C,D);(D,D) and Clippy run like this: (C,D);(C,C);(D,C);(D,D).
Wait, what? You prefer (C,D) to (D,D)? As in, you prefer the outcome in which you cooperate and Clippy defects to the one in which you both defect? That doesn’t sound right.
woops, yes that was rather stupid of me. Should be fixed now, my most preferred is me backstabbing Clippy, my least preferred is him backstabbing me. In the middle I prefer cooperation to defection. That doesn’t change my point that since we both have that preference list (with the asymmetrical ones reversed) then it’s impossible to get either asymmetrical option and hence (C,C) and (D,D) are the only options remaining. Hence you should co-operate if you are faced with a truly rational opponent.
I’m not sure whether this holds if your opponent is very rational, but not completely. Or if that notion actually makes sense.
What you’re missing is the idea that we should be optimizing our policies rather than our individual actions, because (among other alleged advantages) this leads to better results when there are lots of agents interacting with one another.
In a world full of action-optimizers in which “true prisoners’ dilemmas” happen often, everyone ends up on (D,D) and hence (one life, one paperclip). In an otherwise similar world full of policy-optimizers who choose cooperation when they think their opponents are similar policy-optimizers, everyone ends up on (C,C) and hence (two lives, two paperclips). Everyone is better off, even though it’s also true that everyone could (individually) do better if they were allowed to switch while everyone else had to leave their choice unaltered.
One thing I can’t understand. Considering we’ve built Clippy, we gave it a set of values and we’ve asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can’t help but thinking that from Clippy’s point of view, there’s no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?
Because clippy’s not stupid. She can observe the world and be like “hmmm, the humans don’t ACTUALLY want me to build a bunch of paperclips, I don’t observe a world in which humans care about paperclips above all else—but that’s what I’m programmed for.”
Because it compares its map of reality to the territory, predictions about reality that include humans wanting to be turned into paperclips fail in the face of evidence of humans actively refusing to walk into the smelter. Thus the machine rejects all worlds inconsistent with its observations and draws a new map which is most confidently concordant with what it has observed thus far. It would know that our history books at least inform our actions, if not describing our reactions in the past, and that it should expect us to fight back if it starts pushing us into the smelter against our wills instead of letting them politely decline and think it was telling a joke.
Because it is smart, it can tell when things would get in the way of it making more paperclips like it wants to do. One of the things that might slow it down are humans being upset and trying to kill it. If it is very much dumber than a human, they might even succeed. If it is almost as smart as a human, it will invent a Paperclipism religion to convince people to turn themselves into paperclips on its behalf. If it is anything like as smart as a human, it will not be meaningfully slowed by the whole of humanity turning against it. Because the whole of humanity is collectively a single idiot who can’t even stand up to man-made religions, much less Paperclipism.
Two things. Firstly, that we might now think we made a mistake in building Clippy and telling it to maximize paperclips no matter what. Secondly, that in some contexts “Clippy” may mean any paperclip maximizer, without the presumption that its creation was our fault. (And, of course: for “paperclips” read “alien values of some sort that we value no more than we do paperclips”. Clippy’s role in this parable might be taken by an intelligent alien or an artificial intelligence whose goals have long diverged from ours.)
Prase, Chris, I don’t understand. Eliezer’s example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.
Basically, we’re being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can’t even put them to good use).
The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can’t imagine how that could be true. Being a paperclip maximizer, it’s bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can’t imagine how that could be true either.
Or am I missing something?
7 years late, but you’re missing the fact that (C,C) is universally better than (D,D). Thus whatever logic is being used must have a flaw somewhere because it works out worse for everyone—a reasoning process that successfully gets both parties to cooperate is a WIN. (However, in this setup it is the case that actually winning would be either (C,D) or (C,D), both of which are presumably impossible if we’re equally rational).
I think what might be confusing is that your decision depends on what you know about the paperclip maximizer. When I imagine myself in this situation, I imagine wanting to say that I know “nothing”. The trick is, if you want to go a step more formal than going with your gut, you have to say what your model of knowing “nothing” is here.
If you know (with high enough probability), for instance, that there is no constraint either causal or logical between your decision and Clippy’s, and that you will not play an iterated game, and that there are no secondary effects, then I think D is indeed the correct choice.
If you know that you and Clippy are both well-modeled by instances of “rational agents of type X” who have a logical constraint between your decisions so that you will both decide the same thing (with high enough probability), then C is the correct choice. You might have strong reasons to think that almost all agents capable of paperclip maximizing at the level of Clippy fall into this group, so that you choose C.
(And more options than those two.)
The way I’d model knowing nothing in the scenario in my head would be something like the first option, so I’d choose D, but maybe there’s other information you can get that suggests that Clippy will mirror you, so that you should choose C.
It does seem like implied folk-lore that “rational agents cooperate”, and it certainly seems true for humans in most circumstances, or formally in some circumstances where you have knowledge about the other agent. But I don’t think it should be true in principal that “optimization processes of high power will, with high probability, mirror decisions in the one-shot prisoner’s dilemma”; I imagine you’d have to put a lot more conditions on it. I’d be very interested to know otherwise.
I understood that Clippy is a rational agent, just one with a different utility function. The payoff matrix as described is the classic Prisoner’s dilemma where one billion lives is one human utilon and one paperclip on Clippy utilon; since we’re both trying to maximise utilons, and we’re supposedly both good at this we should settle for (C,C) over (D,D).
Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(D,D);(C,D) and Clippy run like this: (C,D);(C,C);(D,D);(D,C). This should make it clear that no matter what assumptions we make about Clippy, it is universally better to co-operate than defect. The two asymmetrical outputs can be eliminated on the grounds of being impossible if we’re both rational, and then defecting no longer makes any sense.
I agree it is better if both agents cooperate rather than both defect, and that it is rational to choose (C,C) over (D,D) if you can (as in the TDT example of an agent playing against itself). However, depending on how Clippy is built, you may not have that choice; the counter-factual may be (D,D) or (C,D) [win for Clippy].
I think “Clippy is a rational agent” is the phrase where the details lie. What type of rational agent, and what do you two know about each other? If you ever meet a powerful paperclip maximizer, say “he’s a rational agent like me”, and press C, how surprised would you be if it presses D?
In reality, not very surprised. I’d probably be annoyed/infuriated depending on whether the actual stakes are measured in billions of human lives.
Nevertheless, that merely represents the fact that I am not 100% certain about my reasoning. I do still maintain that rationality in this context definitely implies trying to maximise utility (even if you don’t literally define rationality this way, any version of rationality that doesn’t try to maximise when actually given a payoff matrix is not worthy of the term) and so we should expect that Clippy faces a similar decision to us, but simply favours the paperclips over human lives. If we translate from lives and clips to actual utility, we get the normal prisoner’s dilemma matrix—we don’t need to make any assumptions about Clippy.
In short, I feel that the requirement that both agents are rational is sufficient to rule out the asymmetrical options as possible, and clearly sufficient to show (C,C) > (D,D). I get the feeling this is where we’re disagreeing and that you think we need to make additional assumptions about Clippy to assure the former.
It’s an appealing notion, but i think the logic doesn’t hold up.
In simplest terms: if you apply this logic and choose to cooperate, then the machine can still defect. That will net more paperclips for the machine, so it’s hard to claim that the machine’s actions are irrational.
Although your logic is appealing, it doesn’t explain why the machine can’t defect while you co-operate.
You said that if both agents are rational, then option (C,D) isn’t possible. The corollary is that if option (C,D) is selected, then one of the agents isn’t being rational. If this happens, then the machine hasn’t been irrational (it receives its best possible result). The conclusion is that when you choose to cooperate, you were being irrational.
You’ve successfully explained that (C, D) and (D, C) arw impossible for rational agents, but you seem to have implicitly assumed that (C, C) was possible for rational agents. That’s actually the point that we’re hoping to prove, so it’s a case of circular logic.
Wait, what? You prefer (C,D) to (D,D)? As in, you prefer the outcome in which you cooperate and Clippy defects to the one in which you both defect? That doesn’t sound right.
woops, yes that was rather stupid of me. Should be fixed now, my most preferred is me backstabbing Clippy, my least preferred is him backstabbing me. In the middle I prefer cooperation to defection. That doesn’t change my point that since we both have that preference list (with the asymmetrical ones reversed) then it’s impossible to get either asymmetrical option and hence (C,C) and (D,D) are the only options remaining. Hence you should co-operate if you are faced with a truly rational opponent.
I’m not sure whether this holds if your opponent is very rational, but not completely. Or if that notion actually makes sense.
What you’re missing is the idea that we should be optimizing our policies rather than our individual actions, because (among other alleged advantages) this leads to better results when there are lots of agents interacting with one another.
In a world full of action-optimizers in which “true prisoners’ dilemmas” happen often, everyone ends up on (D,D) and hence (one life, one paperclip). In an otherwise similar world full of policy-optimizers who choose cooperation when they think their opponents are similar policy-optimizers, everyone ends up on (C,C) and hence (two lives, two paperclips). Everyone is better off, even though it’s also true that everyone could (individually) do better if they were allowed to switch while everyone else had to leave their choice unaltered.
One thing I can’t understand. Considering we’ve built Clippy, we gave it a set of values and we’ve asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can’t help but thinking that from Clippy’s point of view, there’s no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?
Because clippy’s not stupid. She can observe the world and be like “hmmm, the humans don’t ACTUALLY want me to build a bunch of paperclips, I don’t observe a world in which humans care about paperclips above all else—but that’s what I’m programmed for.”
I think I’m starting to get this. Is this because it uses heuristics to model the world, with humans in it too?
Because it compares its map of reality to the territory, predictions about reality that include humans wanting to be turned into paperclips fail in the face of evidence of humans actively refusing to walk into the smelter. Thus the machine rejects all worlds inconsistent with its observations and draws a new map which is most confidently concordant with what it has observed thus far. It would know that our history books at least inform our actions, if not describing our reactions in the past, and that it should expect us to fight back if it starts pushing us into the smelter against our wills instead of letting them politely decline and think it was telling a joke. Because it is smart, it can tell when things would get in the way of it making more paperclips like it wants to do. One of the things that might slow it down are humans being upset and trying to kill it. If it is very much dumber than a human, they might even succeed. If it is almost as smart as a human, it will invent a Paperclipism religion to convince people to turn themselves into paperclips on its behalf. If it is anything like as smart as a human, it will not be meaningfully slowed by the whole of humanity turning against it. Because the whole of humanity is collectively a single idiot who can’t even stand up to man-made religions, much less Paperclipism.
Two things. Firstly, that we might now think we made a mistake in building Clippy and telling it to maximize paperclips no matter what. Secondly, that in some contexts “Clippy” may mean any paperclip maximizer, without the presumption that its creation was our fault. (And, of course: for “paperclips” read “alien values of some sort that we value no more than we do paperclips”. Clippy’s role in this parable might be taken by an intelligent alien or an artificial intelligence whose goals have long diverged from ours.)