What about Ethical AIs? They would quite be able (and completely moral, from a utilitarian point of view!) to blackmail us, and it actually makes sense so we should let ourselves be blackmailed, sort of.
Hm. This would get around (2) and (4), since rather than a tug of war to get created, it’s a many-dimensional tug of war to enforce values. But (1) and (3) are still in force. So since the AI is supposed to be ethical, it’s probably safe to say that not only shouldn’t a rational agent change their actions, ethical AIs that have anything against doing horrible things won’t execute the horrible strategy on values.
Though this still doesn’t exclude the possibility of some more laid-back demerits/rewards system for past behavior implemented by a future AI, which gets around the instinctive part of (1) and all of (3), and is just damped by the fact that it’s a priori improbable.
Why, every minute FAI is delayed is a minute where people suffer and die, creating X disutility. To “encourage” people creating FAI as fast as possible, a FAI might well assign up to X disutility to any person not taking an action which could speed up FAI development by one minute. This is not a priori improbable, it’s not unethical either. In fact, blackmailing us is a logical thing to do. We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI does not exist until it exists, and therefore cannot do anything to make itself come into being faster than it actually did. A presently nonexistent entity cannot make commitments about what it will do once it gets to exist, and missed opportunities which occurred before an FAI’s creation are sunk costs and there would be no point in punishing them.
Hey, what if the future FAI punishes you for making half-baked arguments in the public domain, thereby panicking people, decreasing their rationality, and thereby decreasing the probability of FAI?
If I can model the future FAI with enough accuracy and if TDT turns out to be true, then I can indeed draw the conclusion that it will punish people who know about the importance of FAI but failed to act accordingly.
Also, by my “half baked arguments in the public domain” (which is, in fact, limited to those very few people digging through a discussion post’s comments), I don’t think I panick anyone (if I do, please tell me why), merely thinking about this should not be a reason to panick. It’s at least equally likely that people thinking about this come to the conclusion that the FAI will probably do this and therefore do something to speed up FAI development (e.g. donate to SIAI).
The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.
We are not rational. If our utility function was perfectly utilitarian, without any of the common biases (scope insensitivity, hyperbolic discounting etc.), you’d be correct, the FAI wouldn’t blackmail us. But since we are not perfectly rational, an FAI might well come to the conclusion that blackmailing us will help, and it’ll punish those who didn’t devote their life to FAI development although they had the required intelligence to contribute and knew about the problem of FAI creation. This punishment will not be 3^^^3 units of disutility, but it will still be enough so we don’t want to suffer from this.
This seems like one of those times where we can choose to be rational—the outcome depends only on your strategy about this one thing, not on your general rationality. So if you are fine with being blackmailed (or, on the lighter side, bribed) by a future FAI, you can do that, but it’s not necessary, given a fairly small amount of willpower.
I think we currently experience an inferential gap—I feel that you don’t understand me, and I don’t understand you. What do you want to tell me? I don’t see how one could choose to “deny” that blackmail, once you realize that a future FAI may punish you, I don’t see anything one could do, not even one of the points from (1) to (4), other than accepting the blackmail.
When there’s a high enough chance someone won’t give in even if you use the horrible strategy, it’s not cost-effective to do the horrible strategy. Or even if you model doing the horrible strategy on everyone who realizes ever always, it is made cost-ineffective by group rejection. So if you’re an AI choosing between winning strategies, and the horrible strategy turns out to be a net loss in your models, you won’t do it. Therefore, stand strong! :P
I feel like this is something like burning a blackmail letter and then pretending to never have read it. If I know that the person in question has at some time understood the blackmail letter, but then deliberately burnt it and tried to forget about it, I will still impose the punishment.
Why should a “horrible strategy” (which might not even be so very horrible, being punished is probably still better than an FAI not existing) be a net loss? Even if you don’t understand the blackmailing or refuse to give in, FAI development is still a very important thing to do, and if you accept it and act accordingly, that’s beneficial to the FAI and therefore for all humans. So that’s 0 loss against a slight gain.
Also, I wouldn’t even want to refuse the blackmailing, because accepting it will be beneficial to humanity (in case I act accordingly, that is).
I don’t know what you mean by “group rejection”, a search didn’t bring up any results.
The blackmail letter hasn’t been sent, though, when you think about it. Until the AI actually is created and chooses a strategy, bargaining doesn’t stop. But it’s true that you have to be willing to counterfactually burn blackmail letters if you want to not receive them in the first place.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
“Group rejection” just meant “nearly everyone doesn’t give in.”
And yeah, if you want to be blackmailed, go for it :P
The blackmail letter hasn’t been sent, though, when you think about it.
Again, I am not convinced by this argument. I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster. If you realize that it might punish you in some way if this prospect had the chance of altering your actions, you have opened and read the letter. So the only possibility how you can resist the blackmailing is not to think about it in the first place. Once you have thought about it, however, I really don’t see what one could do.
Imagine I have a pill which turns me into a mindless zombie (not the philosophical one) for 3 hours. Taking this pill also leaves you with amnesia after I’m back to normal. I know that society punishes killing people. If I now deliberately take that pill (analogous to deliberately refusing the blackmailing without good reasons other than “I don’t want to get blackmailed”), can I expect punishment?
This also has a good application in reality: Drinking is not illegal. Committing crimes is. But in Germany, if one drinks oneself beyond a certain point (in a drunken stupor), one is considered not criminally responsible for ones actions. However, one can be punished for drinking oneself to this point.
Likewise, if you deliberately try to forget about that blackmail once you have thought about it, the future FAI might consider the deliberate act of forgetting punishworthy.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
Consider that a punishment influences my actions in a way that where I was beforehand not quite likely to speed up FAI development by 1 minute by doing X, creating Y extra utility, and after considering the blackmailing, I am much more likely to do X. How large is the punishment that the FAI may impose on me without becoming unfriendly? It’s greater than zero, because if the AI, by punishing me with Y-1 utility (or threatening to punish me, that is), gains an expected utility of Y that it would otherwise not gain, it will definitely threaten to punish me. Note that the things the FAI might do to someone are far from being horrible, post singularity might just be a little less fun, but enough that I’d prefer doing X.
If nearly everyone doesn’t give in after thinking about it, then indeed the FAI will only punish those who were in some way influenced by the punishment, although “deliberately not giving in merely because one doesn’t want to be blackmailed” is kind of impossible, see above.
And yeah, if you want to be blackmailed, go for it :P
I have to assume that this (speeding up FAI development) is best in any case.
I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster
You are simply mistaken. The analogy to blackmail may be misleading you—maybe try thinking about it without analogy. You might also read up on the subject, for example by reading Eliezer’s TDT paper
I’d like to see other opinions on this because I don’t see that we are proceeding any further.
I now read important parts of the TDT paper (more than just the abstract) and would say I understood at least those parts, though I don’t see anything that would contradict my considerations. I’m sorry, but I’m still not convinced. The analogies serve as a way to make the problem better graspable to intuition, but initially I thoguht about this without such analogies. I still don’t get where my reasoning is flawed. Could you try different approaches?
Hm. Actually, if you think about the following game, where A is the AI and B is the human:
~A1~A2 ~Bx~+9,-1+10,-1 ~By~ −1,-10+0,+0
The Nash equilibrium of the game is A2,By—that is, not horrible and doesn’t give in.
But if we have two agents facing off that don’t make moves independently, but instead choose winning strategies, there are multiple equilibria. I should really read Strategy of Conflict. The initiative to choose a particular equilibrium, however, is ours for the taking, for obvious temporal reasons. If we choose one of the equilibrium strategies, we dictate the other equilibrium strategy to the AI.
You are probably correct—if it’s possible to plausibly precommit oneself to under no circumstances be influenced by any type of blackmailing, then and only then doesn’t it make sense for the AI to threaten to punish people, that is, then an AI which punishes non-helping persons who precommited themselves to helping under no circumstances is unlikely. The problem is that precommitting oneself to under no circumstances helping might be very hard: An AI will still assign a probability greater than zero to the possibility that I can be influenced by the blackmailing and the more this probability approaches zero, the more will the amount of expected utility for the AI in case that it manages to convince me converge to Y, which means that the punishment I have to expect in case I don’t help will also converge to Y.
But wait! As the probability that I’m influenced by the AI shrinks, the probability that it without any good incentive imposes a punishment converging to Y grows, but since we consider a friendly AI, this will also impose a negative expected utility converging to Y on the AI. This should mean that the expected punishment should shrink much faster the more the probability of my actions being influenced by the AI, as rated by the AI, goes down. Thusly, the more convincingly I state that I won’t be influenced by any kind of blackmailing, the more rapidly will the expected punishment shrink, effectively becoming a minor inconvenience or less, since the AI will also assume that I’ll try to avoid punishment and thusly regulate the probability of me being influenced even further down.
I hope I’m correct at this one. If so, thank you, Manfred.
I hereby state that I will under no circumstances be influenced by any kind of acausal threatening from any kind of future Artificial Intelligence, which means that imposing any kind of punishment upon me is rendered completely useless.
The line of reasoning I followed before coming to this insight was flawed, and I dearly have to apologize for wasting your precious computing power, probable future AI, and your time and effort, Manfred, Mitchell Porter, Multiheaded.
Attempt to summarize: a FAI would likely be too nice to bother with not-perfectly-effective torture, and I suspect that it’s a big ol’ softie, so I’m not feeling threatened much, making it even less effective, and so on in a positive feedback loop.
You can say: “Screw it, monsters from the future don’t dictate my actions, period”. This is expected to make any such pre-commitment to punish you pointless, as its threats no longer affect your behavior.
As someone mentioned, it’s like playing chicken against a remotely controlled car on a collision course with yours; you have everything to lose while the opponent’s costs are much less, but if you don’t EVER chicken out, it loses out slightly and gains nothing with such a strategy. Therefore, if it has a high opinion of your willpower, it’s not going to chose that strategy.
Well, if the FAI knows that you thought about this but then rejected it, deliberately trying to make that pre-commitment pointless, that’s not a reason not to punish you. It’s like burning a blackmail letter; if you read the blackmail letter and the blackmailer knows this, he will still punish you.
In that chicken game it’s similar: If I knew that the opponent would punish me for not chickening out and then deliberately changed myself so that I wouldn’t know this, the opponent will still punish me—because I deliberately chose not to chicken out when I altered myself.
Also, creating FAI is in my best interest, so I’d want to chicken out even if I knew the opponent would chicken out as well. The only case in which blackmailing is useless is if I always chicken out (=work towards FAI), or if it doesn’t influence my actions because I’m already so altruistic that I will push for FAI regardless of my personal gains/losses, but we are humans, after all, so it probably will.
What about Ethical AIs? They would quite be able (and completely moral, from a utilitarian point of view!) to blackmail us, and it actually makes sense so we should let ourselves be blackmailed, sort of.
Hm. This would get around (2) and (4), since rather than a tug of war to get created, it’s a many-dimensional tug of war to enforce values. But (1) and (3) are still in force. So since the AI is supposed to be ethical, it’s probably safe to say that not only shouldn’t a rational agent change their actions, ethical AIs that have anything against doing horrible things won’t execute the horrible strategy on values.
Though this still doesn’t exclude the possibility of some more laid-back demerits/rewards system for past behavior implemented by a future AI, which gets around the instinctive part of (1) and all of (3), and is just damped by the fact that it’s a priori improbable.
Why, every minute FAI is delayed is a minute where people suffer and die, creating X disutility. To “encourage” people creating FAI as fast as possible, a FAI might well assign up to X disutility to any person not taking an action which could speed up FAI development by one minute. This is not a priori improbable, it’s not unethical either. In fact, blackmailing us is a logical thing to do. We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI does not exist until it exists, and therefore cannot do anything to make itself come into being faster than it actually did. A presently nonexistent entity cannot make commitments about what it will do once it gets to exist, and missed opportunities which occurred before an FAI’s creation are sunk costs and there would be no point in punishing them.
Hey, what if the future FAI punishes you for making half-baked arguments in the public domain, thereby panicking people, decreasing their rationality, and thereby decreasing the probability of FAI?
If I can model the future FAI with enough accuracy and if TDT turns out to be true, then I can indeed draw the conclusion that it will punish people who know about the importance of FAI but failed to act accordingly.
Also, by my “half baked arguments in the public domain” (which is, in fact, limited to those very few people digging through a discussion post’s comments), I don’t think I panick anyone (if I do, please tell me why), merely thinking about this should not be a reason to panick. It’s at least equally likely that people thinking about this come to the conclusion that the FAI will probably do this and therefore do something to speed up FAI development (e.g. donate to SIAI).
The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.
Even if the blackmailer is a nice guy, present rational agents still, from their perspectives, shouldn’t change their decisions.
We are not rational. If our utility function was perfectly utilitarian, without any of the common biases (scope insensitivity, hyperbolic discounting etc.), you’d be correct, the FAI wouldn’t blackmail us. But since we are not perfectly rational, an FAI might well come to the conclusion that blackmailing us will help, and it’ll punish those who didn’t devote their life to FAI development although they had the required intelligence to contribute and knew about the problem of FAI creation. This punishment will not be 3^^^3 units of disutility, but it will still be enough so we don’t want to suffer from this.
This seems like one of those times where we can choose to be rational—the outcome depends only on your strategy about this one thing, not on your general rationality. So if you are fine with being blackmailed (or, on the lighter side, bribed) by a future FAI, you can do that, but it’s not necessary, given a fairly small amount of willpower.
I think we currently experience an inferential gap—I feel that you don’t understand me, and I don’t understand you. What do you want to tell me? I don’t see how one could choose to “deny” that blackmail, once you realize that a future FAI may punish you, I don’t see anything one could do, not even one of the points from (1) to (4), other than accepting the blackmail.
When there’s a high enough chance someone won’t give in even if you use the horrible strategy, it’s not cost-effective to do the horrible strategy. Or even if you model doing the horrible strategy on everyone who realizes ever always, it is made cost-ineffective by group rejection. So if you’re an AI choosing between winning strategies, and the horrible strategy turns out to be a net loss in your models, you won’t do it. Therefore, stand strong! :P
I feel like this is something like burning a blackmail letter and then pretending to never have read it. If I know that the person in question has at some time understood the blackmail letter, but then deliberately burnt it and tried to forget about it, I will still impose the punishment.
Why should a “horrible strategy” (which might not even be so very horrible, being punished is probably still better than an FAI not existing) be a net loss? Even if you don’t understand the blackmailing or refuse to give in, FAI development is still a very important thing to do, and if you accept it and act accordingly, that’s beneficial to the FAI and therefore for all humans. So that’s 0 loss against a slight gain.
Also, I wouldn’t even want to refuse the blackmailing, because accepting it will be beneficial to humanity (in case I act accordingly, that is).
I don’t know what you mean by “group rejection”, a search didn’t bring up any results.
The blackmail letter hasn’t been sent, though, when you think about it. Until the AI actually is created and chooses a strategy, bargaining doesn’t stop. But it’s true that you have to be willing to counterfactually burn blackmail letters if you want to not receive them in the first place.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
“Group rejection” just meant “nearly everyone doesn’t give in.”
And yeah, if you want to be blackmailed, go for it :P
Again, I am not convinced by this argument. I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster. If you realize that it might punish you in some way if this prospect had the chance of altering your actions, you have opened and read the letter. So the only possibility how you can resist the blackmailing is not to think about it in the first place. Once you have thought about it, however, I really don’t see what one could do.
Imagine I have a pill which turns me into a mindless zombie (not the philosophical one) for 3 hours. Taking this pill also leaves you with amnesia after I’m back to normal. I know that society punishes killing people. If I now deliberately take that pill (analogous to deliberately refusing the blackmailing without good reasons other than “I don’t want to get blackmailed”), can I expect punishment?
This also has a good application in reality: Drinking is not illegal. Committing crimes is. But in Germany, if one drinks oneself beyond a certain point (in a drunken stupor), one is considered not criminally responsible for ones actions. However, one can be punished for drinking oneself to this point.
Likewise, if you deliberately try to forget about that blackmail once you have thought about it, the future FAI might consider the deliberate act of forgetting punishworthy.
Consider that a punishment influences my actions in a way that where I was beforehand not quite likely to speed up FAI development by 1 minute by doing X, creating Y extra utility, and after considering the blackmailing, I am much more likely to do X. How large is the punishment that the FAI may impose on me without becoming unfriendly? It’s greater than zero, because if the AI, by punishing me with Y-1 utility (or threatening to punish me, that is), gains an expected utility of Y that it would otherwise not gain, it will definitely threaten to punish me. Note that the things the FAI might do to someone are far from being horrible, post singularity might just be a little less fun, but enough that I’d prefer doing X.
If nearly everyone doesn’t give in after thinking about it, then indeed the FAI will only punish those who were in some way influenced by the punishment, although “deliberately not giving in merely because one doesn’t want to be blackmailed” is kind of impossible, see above.
You are simply mistaken. The analogy to blackmail may be misleading you—maybe try thinking about it without analogy. You might also read up on the subject, for example by reading Eliezer’s TDT paper
I’d like to see other opinions on this because I don’t see that we are proceeding any further.
I now read important parts of the TDT paper (more than just the abstract) and would say I understood at least those parts, though I don’t see anything that would contradict my considerations. I’m sorry, but I’m still not convinced. The analogies serve as a way to make the problem better graspable to intuition, but initially I thoguht about this without such analogies. I still don’t get where my reasoning is flawed. Could you try different approaches?
Hm. Actually, if you think about the following game, where A is the AI and B is the human:
~A1~A2
~Bx~+9,-1+10,-1
~By~ −1,-10+0,+0
The Nash equilibrium of the game is A2,By—that is, not horrible and doesn’t give in.
But if we have two agents facing off that don’t make moves independently, but instead choose winning strategies, there are multiple equilibria. I should really read Strategy of Conflict. The initiative to choose a particular equilibrium, however, is ours for the taking, for obvious temporal reasons. If we choose one of the equilibrium strategies, we dictate the other equilibrium strategy to the AI.
You are probably correct—if it’s possible to plausibly precommit oneself to under no circumstances be influenced by any type of blackmailing, then and only then doesn’t it make sense for the AI to threaten to punish people, that is, then an AI which punishes non-helping persons who precommited themselves to helping under no circumstances is unlikely. The problem is that precommitting oneself to under no circumstances helping might be very hard: An AI will still assign a probability greater than zero to the possibility that I can be influenced by the blackmailing and the more this probability approaches zero, the more will the amount of expected utility for the AI in case that it manages to convince me converge to Y, which means that the punishment I have to expect in case I don’t help will also converge to Y.
But wait! As the probability that I’m influenced by the AI shrinks, the probability that it without any good incentive imposes a punishment converging to Y grows, but since we consider a friendly AI, this will also impose a negative expected utility converging to Y on the AI. This should mean that the expected punishment should shrink much faster the more the probability of my actions being influenced by the AI, as rated by the AI, goes down. Thusly, the more convincingly I state that I won’t be influenced by any kind of blackmailing, the more rapidly will the expected punishment shrink, effectively becoming a minor inconvenience or less, since the AI will also assume that I’ll try to avoid punishment and thusly regulate the probability of me being influenced even further down.
I hope I’m correct at this one. If so, thank you, Manfred.
I hereby state that I will under no circumstances be influenced by any kind of acausal threatening from any kind of future Artificial Intelligence, which means that imposing any kind of punishment upon me is rendered completely useless.
The line of reasoning I followed before coming to this insight was flawed, and I dearly have to apologize for wasting your precious computing power, probable future AI, and your time and effort, Manfred, Mitchell Porter, Multiheaded.
Do I win? I hope I do.
Attempt to summarize: a FAI would likely be too nice to bother with not-perfectly-effective torture, and I suspect that it’s a big ol’ softie, so I’m not feeling threatened much, making it even less effective, and so on in a positive feedback loop.
You can say: “Screw it, monsters from the future don’t dictate my actions, period”. This is expected to make any such pre-commitment to punish you pointless, as its threats no longer affect your behavior.
As someone mentioned, it’s like playing chicken against a remotely controlled car on a collision course with yours; you have everything to lose while the opponent’s costs are much less, but if you don’t EVER chicken out, it loses out slightly and gains nothing with such a strategy. Therefore, if it has a high opinion of your willpower, it’s not going to chose that strategy.
Well, if the FAI knows that you thought about this but then rejected it, deliberately trying to make that pre-commitment pointless, that’s not a reason not to punish you. It’s like burning a blackmail letter; if you read the blackmail letter and the blackmailer knows this, he will still punish you.
In that chicken game it’s similar: If I knew that the opponent would punish me for not chickening out and then deliberately changed myself so that I wouldn’t know this, the opponent will still punish me—because I deliberately chose not to chicken out when I altered myself.
Also, creating FAI is in my best interest, so I’d want to chicken out even if I knew the opponent would chicken out as well. The only case in which blackmailing is useless is if I always chicken out (=work towards FAI), or if it doesn’t influence my actions because I’m already so altruistic that I will push for FAI regardless of my personal gains/losses, but we are humans, after all, so it probably will.