Wouldn’t the blackmailer reason along the lines of “If I let my choice of whether to blackmail be predicated on whether or not the victim would take my blackmailing into account, wouldn’t that just give them motive to predict and self modify to not allow themselves to be influenced by that?” Then, by the corresponding reasoning, the potential blackmail victims might reason “I have nothing to gain by ignoring it”
Well, sure, if the blackmail victim were silly enough to reason “I have nothing to gain by ignoring it” if the blackmailer went through anyway, then the blackmailer would indeed decide to ignore their decision to ignore it and go through anyway. But that’s only if the blackmail victim is that silly.
In a problem like this, the “do nothing” side has the advantage; there’s nothing the other side can do to make them be responsive and blackmailable. That’s why I expect TDT to resolve to a blackmail-free equilibrium.
I was thinking along the lines of the blackmailer using the same reasoning to decide that whether or not the potential victim of blackmail would be a blackmail ignorer or not, the blackmailer would still blackmail regardless.
ie, Blackmailer, for similar reasoning to the potential Victim, decides that they should make sure that the victim has nothing to gain by choosing ignore by making sure that they themselves (Blackmailer) would precommit to ignoring whether or not. ie, in this sense the blackmailer is also taking a “do nothing” thing in the sense that there’s nothing the victim can do to stop them from blackmailing.
This sort of thing would seem to lead to an equilibrium of lots of blackmailers blackmailing victims that will ignore them. Which is, of course, a pathalogical outcome, and any sane decision theory should reject it. No blackmail seems like the “right” equilibrium, but it’s not obvious to me exactly how TDT would get there.
I was thinking along the lines of the blackmailer using the same reasoning to decide that whether or not the potential victim of blackmail would be a blackmail ignorer or not, the blackmailer would still blackmail regardless.
Only if you expect that the blackmail victim has not “already” decided that if the blackmailer does that, they will still ignore the blackmail regardless. Wise agents ignore order-0 blackmail, ignore order-1 blackmail in which the blackmailer decides to ignore their ignorance of order-0 blackmail, ignore order-omega blackmail in which the blackmailer decides to ignore all order-N refusals to be blackmailed, etcetera for all ordinals. If there is some ordinal of blackmail you do not ignore, you can be blackmailed, and how does that help?
This sort of thing would seem to lead to an equilibrium of lots of blackmailers blackmailing victims that will ignore them.
Only if those blackmailers have wrongly anticipated that their victims will be stupid enough to conform.
Only if those blackmailers have wrongly anticipated that their victims will be stupid enough to conform.
Not blackmailing in response to that anticipation is a property of the behavior of the blackmailers that seems to have been used in deciding to ignore all blackmail. Suppose there were lots of “stupid” blackmailers around that blackmailed everyone all day, even if no victim ever conformed. Would it be a good idea to ignore all blackmail in that case? Is there a distinction between such blackmailers and particularly unfair laws of physics (say, sadistic Lords of the Matrix)? (It seems plausible that there is no fundamental distinction, and sometimes the correct decision is to ignore these worlds, focusing on other possibilities instead, but that seems to require knowing that there are valuable other possibilities that would be hurt by permitting the assumption that you are on one of the bad worlds, and if you have good evidence that you are on one of the bad worlds, then rejecting that possibility means that you’d have to focus on very strange interpretations of that evidence that don’t imply that you are on the bad worlds. This sort of rule seems to follow from deciding on a global strategy across possible worlds. It doesn’t provide decisions that help on the bad words though, the decisions would only have a good effect across worlds.)
(I still don’t have a good idea of what “blackmail” or “order-N considerations” means. Status quo (including the “default behavior”, “do nothing”, “not spending resources”) seems like exactly the kind of thing that can be determined by decisions. You are only “expending resources” if you eventually lose, as the time at which resources are spent and gained seems irrelevant, so by that definition it seems that whether something is an instance of blackmail depends on whether it’s successful. I suspect there is no simple rule for games, too many assumptions are potentially controllable by the opponent, and the only thing to do is to compare the consequences of alternative actions, and just act on that, which already potentially takes into account how the alternative actions would be taken into account by other agents, how the way in which they would be taken into account by some agents would influence the way in which the actions influence the decisions of other agents etc. Some sense of “no blackmail” may be a correct expectation about smart agents, but it doesn’t necessarily suggest a good decision rule.)
Not blackmailing in response to that anticipation is a property of the behavior of the blackmailers that seems to have been used in deciding to ignore all blackmail.
Expecting a response to blackmail in the first place is why blackmailers would even exist in the first place.
Suppose there were lots of “stupid” blackmailers around that blackmailed everyone all day, even if no victim ever conformed.
Why would these exist any more than stupid anti-blackmailers (who e.g. go around attacking anyone who would give into blackmail if a blackmailer showed up), if not for a belief that somebody would give in to blackmail?
I think what Nesov is talking about is best described as a mind that will attack conditioned on victim behavior alone (not considering possible behavior changes of the victim in any way). This is different from an N order blackmailer. In fact I think blackmail is the wrong word here (Nesov says that he does not know what blackmail means in this context, so this is not that surprising). For example, instead of seeking behavior modification through threats, such a mind seeks justice through retribution. I think the most likely SI that implements this is extrapolating an evolved minds preferences. The will to seek justice trough retribution leads to behavior changes in many cases, which leads to an evolutionary advantage. But once it has evolved, its a preference. If a guy committed a horrific crime (completely ignoring all sorts of law enforcement threats), and then it was somehow ensured that he could never hurt anyone again, most people would want justice (and other evolved minds might have made the same simplification (“if someone does that, I will hit them” is a relatively easily encoded and relatively effective strategy)).
It is true that there might exist minds that will see the act of “giving in to retribution seekers” as deserving of retribution, and this could in principle cancel out all other retribution seekers. It would seem like privileging the hypothesis to think that all such things cancel out completely. You might have absolutely no way of estimating which actions would make people seek retribution against you (I think the most complicating factor is that many considers “non punishment of evildoers” to be worthy of retribution, and others consider “punishment of people that are not actually evildoers” as worthy of retribution), but that is a fact about your map, not a fact about the territory (and unlike the blackmail thing, this is not an instance of ignorance to be celebrated). And the original topic was what an SI would do.
An SI would presumably be able to estimate this. In the case of an SI that is otherwise indifferent to humans, this cashes out to increased utility for “punish humans to avoid retribution from those that think the non-punishment of humans is worthy of retribution” and increased utility for “treat humans nicely to avoid retribution from those that would seek retribution for not treating them nicely” (those that require extermination is not really that important if that is the default behavior). If the resources it would take to punish or help humans is small, this would reduce probability of extermination, and increase probability of punishment and help. The type of punishment would be in the form that would avoid retribution from those that categorically seek retribution for that type of punishment regardless of what the “crime” was. If there are lots of (evolvable, and likely to be extrapolated) minds that agree that a certain type of punishment (directed at our type of minds) constitute “torture” and that torturers deserve to be punished (completely independently of how this effects their actions), then it will have to find some other form of punishment. So, basically: “increased probability for very clever solutions that satisfy those demanding punishment, while not pissing of those that categorically dislikes certain types of punishments” (so, some sort of convoluted and confusing existence that some (evolvable and retribution inclined) minds consider “good enough punishment”, and others consider “treated acceptably”). At least increased probability of “staying alive a bit longer in some way that costs very little resources”.
This would for example have policy implications for people that assume the many worlds interpretation and does not care about measure. They can no longer launch a bunch of “semi randomized AIs” (not random in the sense of “random neural network connections” but more along the lines of “letting many teams create many designs, and then randomly select which one to run”) and hope that one will turn out ok, and that the others will just kill everyone (since they can no longer be sure that an uncaring AI will kill them, they can no longer be sure that they will wake up in the universe of a caring AI).
(this seems related to what Will talks about sometimes, but using very different terminology)
Since following through with a threat is (almost?) always costly to the blackmailer, victims do gain something by ignoring it. They force the blackmailer to put up or shut up so to speak. On the other hand, victims do have something to lose by not ignoring blackmail. They allow their actions to be manipulated at little to no cost by the blackmailer.
That is, if you have a “never-give-into-blackmail-bot” then there is a “no-blackmail” equilibrium. The addition of blackmail does nothing but potentially impose costs on the blackmailer. If following through with threat was a net gain for the blackmailer then they should just do that regardless.
I was imagining that a potential blackmailer would self modify/be an Always-Blackmail-bot specifically to make sure there would be no incentive for potential victims to be a “never-give-in-to-blackmail-bot”
But that leads to stupid equilibrium of plenty of blackmailers and no participating victims. Everyone loses.
Yes, I agree that no blackmail seems to be the Right Equilibrium, but it’s not obvious to me exactly how to get there without the same reasoning that leads to becoming a never-give-in-bot also leading potential blackmailers to becoming always-blackmail-bots.
I find I am somewhat confused on this matter. Well, frankly I suspect I’m just being stupid, that there’s some obvious extra step in the reasoning I’m being blind to. It “feels” that way, for lack of better terms.
The act of agent A blackmailing agent B costs agent A more than not blackmailing agent B (at the very least A could use the time spent saying “if you don’t do X then I will do Y” on something else).
If A is an always-blackmail-bot then A will continue to incur the costs of futilely blackmailing B (given that B does not give in to blackmail).
If the costs of blackmailing B (and/or following through with the threat) are not negative, then A should blackmail B (and/or follow through with the threat) regardless of B’s position on blackmail. And by extension, agent B has no incentive to switch from his or her never-give-in strategy.
If A inspects B and determines that B will never give in to blackmail, then A will not waste resources blackmailing B.
Wouldn’t the blackmailer reason along the lines of “If I let my choice of whether to blackmail be predicated on whether or not the victim would take my blackmailing into account, wouldn’t that just give them motive to predict and self modify to not allow themselves to be influenced by that?” Then, by the corresponding reasoning, the potential blackmail victims might reason “I have nothing to gain by ignoring it”
I’m a bit confused on this matter.
Well, sure, if the blackmail victim were silly enough to reason “I have nothing to gain by ignoring it” if the blackmailer went through anyway, then the blackmailer would indeed decide to ignore their decision to ignore it and go through anyway. But that’s only if the blackmail victim is that silly.
In a problem like this, the “do nothing” side has the advantage; there’s nothing the other side can do to make them be responsive and blackmailable. That’s why I expect TDT to resolve to a blackmail-free equilibrium.
I was thinking along the lines of the blackmailer using the same reasoning to decide that whether or not the potential victim of blackmail would be a blackmail ignorer or not, the blackmailer would still blackmail regardless.
ie, Blackmailer, for similar reasoning to the potential Victim, decides that they should make sure that the victim has nothing to gain by choosing ignore by making sure that they themselves (Blackmailer) would precommit to ignoring whether or not. ie, in this sense the blackmailer is also taking a “do nothing” thing in the sense that there’s nothing the victim can do to stop them from blackmailing.
This sort of thing would seem to lead to an equilibrium of lots of blackmailers blackmailing victims that will ignore them. Which is, of course, a pathalogical outcome, and any sane decision theory should reject it. No blackmail seems like the “right” equilibrium, but it’s not obvious to me exactly how TDT would get there.
Only if you expect that the blackmail victim has not “already” decided that if the blackmailer does that, they will still ignore the blackmail regardless. Wise agents ignore order-0 blackmail, ignore order-1 blackmail in which the blackmailer decides to ignore their ignorance of order-0 blackmail, ignore order-omega blackmail in which the blackmailer decides to ignore all order-N refusals to be blackmailed, etcetera for all ordinals. If there is some ordinal of blackmail you do not ignore, you can be blackmailed, and how does that help?
Only if those blackmailers have wrongly anticipated that their victims will be stupid enough to conform.
Not blackmailing in response to that anticipation is a property of the behavior of the blackmailers that seems to have been used in deciding to ignore all blackmail. Suppose there were lots of “stupid” blackmailers around that blackmailed everyone all day, even if no victim ever conformed. Would it be a good idea to ignore all blackmail in that case? Is there a distinction between such blackmailers and particularly unfair laws of physics (say, sadistic Lords of the Matrix)? (It seems plausible that there is no fundamental distinction, and sometimes the correct decision is to ignore these worlds, focusing on other possibilities instead, but that seems to require knowing that there are valuable other possibilities that would be hurt by permitting the assumption that you are on one of the bad worlds, and if you have good evidence that you are on one of the bad worlds, then rejecting that possibility means that you’d have to focus on very strange interpretations of that evidence that don’t imply that you are on the bad worlds. This sort of rule seems to follow from deciding on a global strategy across possible worlds. It doesn’t provide decisions that help on the bad words though, the decisions would only have a good effect across worlds.)
(I still don’t have a good idea of what “blackmail” or “order-N considerations” means. Status quo (including the “default behavior”, “do nothing”, “not spending resources”) seems like exactly the kind of thing that can be determined by decisions. You are only “expending resources” if you eventually lose, as the time at which resources are spent and gained seems irrelevant, so by that definition it seems that whether something is an instance of blackmail depends on whether it’s successful. I suspect there is no simple rule for games, too many assumptions are potentially controllable by the opponent, and the only thing to do is to compare the consequences of alternative actions, and just act on that, which already potentially takes into account how the alternative actions would be taken into account by other agents, how the way in which they would be taken into account by some agents would influence the way in which the actions influence the decisions of other agents etc. Some sense of “no blackmail” may be a correct expectation about smart agents, but it doesn’t necessarily suggest a good decision rule.)
Expecting a response to blackmail in the first place is why blackmailers would even exist in the first place.
Why would these exist any more than stupid anti-blackmailers (who e.g. go around attacking anyone who would give into blackmail if a blackmailer showed up), if not for a belief that somebody would give in to blackmail?
I think what Nesov is talking about is best described as a mind that will attack conditioned on victim behavior alone (not considering possible behavior changes of the victim in any way). This is different from an N order blackmailer. In fact I think blackmail is the wrong word here (Nesov says that he does not know what blackmail means in this context, so this is not that surprising). For example, instead of seeking behavior modification through threats, such a mind seeks justice through retribution. I think the most likely SI that implements this is extrapolating an evolved minds preferences. The will to seek justice trough retribution leads to behavior changes in many cases, which leads to an evolutionary advantage. But once it has evolved, its a preference. If a guy committed a horrific crime (completely ignoring all sorts of law enforcement threats), and then it was somehow ensured that he could never hurt anyone again, most people would want justice (and other evolved minds might have made the same simplification (“if someone does that, I will hit them” is a relatively easily encoded and relatively effective strategy)).
It is true that there might exist minds that will see the act of “giving in to retribution seekers” as deserving of retribution, and this could in principle cancel out all other retribution seekers. It would seem like privileging the hypothesis to think that all such things cancel out completely. You might have absolutely no way of estimating which actions would make people seek retribution against you (I think the most complicating factor is that many considers “non punishment of evildoers” to be worthy of retribution, and others consider “punishment of people that are not actually evildoers” as worthy of retribution), but that is a fact about your map, not a fact about the territory (and unlike the blackmail thing, this is not an instance of ignorance to be celebrated). And the original topic was what an SI would do.
An SI would presumably be able to estimate this. In the case of an SI that is otherwise indifferent to humans, this cashes out to increased utility for “punish humans to avoid retribution from those that think the non-punishment of humans is worthy of retribution” and increased utility for “treat humans nicely to avoid retribution from those that would seek retribution for not treating them nicely” (those that require extermination is not really that important if that is the default behavior). If the resources it would take to punish or help humans is small, this would reduce probability of extermination, and increase probability of punishment and help. The type of punishment would be in the form that would avoid retribution from those that categorically seek retribution for that type of punishment regardless of what the “crime” was. If there are lots of (evolvable, and likely to be extrapolated) minds that agree that a certain type of punishment (directed at our type of minds) constitute “torture” and that torturers deserve to be punished (completely independently of how this effects their actions), then it will have to find some other form of punishment. So, basically: “increased probability for very clever solutions that satisfy those demanding punishment, while not pissing of those that categorically dislikes certain types of punishments” (so, some sort of convoluted and confusing existence that some (evolvable and retribution inclined) minds consider “good enough punishment”, and others consider “treated acceptably”). At least increased probability of “staying alive a bit longer in some way that costs very little resources”.
This would for example have policy implications for people that assume the many worlds interpretation and does not care about measure. They can no longer launch a bunch of “semi randomized AIs” (not random in the sense of “random neural network connections” but more along the lines of “letting many teams create many designs, and then randomly select which one to run”) and hope that one will turn out ok, and that the others will just kill everyone (since they can no longer be sure that an uncaring AI will kill them, they can no longer be sure that they will wake up in the universe of a caring AI).
(this seems related to what Will talks about sometimes, but using very different terminology)
Agreed that this is a different case, since it doesn’t originate in any expectation of behavior modification.
Since following through with a threat is (almost?) always costly to the blackmailer, victims do gain something by ignoring it. They force the blackmailer to put up or shut up so to speak. On the other hand, victims do have something to lose by not ignoring blackmail. They allow their actions to be manipulated at little to no cost by the blackmailer.
That is, if you have a “never-give-into-blackmail-bot” then there is a “no-blackmail” equilibrium. The addition of blackmail does nothing but potentially impose costs on the blackmailer. If following through with threat was a net gain for the blackmailer then they should just do that regardless.
I was imagining that a potential blackmailer would self modify/be an Always-Blackmail-bot specifically to make sure there would be no incentive for potential victims to be a “never-give-in-to-blackmail-bot”
But that leads to stupid equilibrium of plenty of blackmailers and no participating victims. Everyone loses.
Yes, I agree that no blackmail seems to be the Right Equilibrium, but it’s not obvious to me exactly how to get there without the same reasoning that leads to becoming a never-give-in-bot also leading potential blackmailers to becoming always-blackmail-bots.
I find I am somewhat confused on this matter. Well, frankly I suspect I’m just being stupid, that there’s some obvious extra step in the reasoning I’m being blind to. It “feels” that way, for lack of better terms.
My argument is more or less as follows:
The act of agent A blackmailing agent B costs agent A more than not blackmailing agent B (at the very least A could use the time spent saying “if you don’t do X then I will do Y” on something else).
If A is an always-blackmail-bot then A will continue to incur the costs of futilely blackmailing B (given that B does not give in to blackmail).
If the costs of blackmailing B (and/or following through with the threat) are not negative, then A should blackmail B (and/or follow through with the threat) regardless of B’s position on blackmail. And by extension, agent B has no incentive to switch from his or her never-give-in strategy.
If A inspects B and determines that B will never give in to blackmail, then A will not waste resources blackmailing B.
Blackmail, almost definitionally, only happens in conditions of incomplete information.