But what a prior-and-utility system means by “credible” is that the expected disutility is large. If a blackmailer can, at finite cost to itself, put our AI in a situation with arbitrarily high expected disutility, then our AI is boned.
Ah, you’re worried about a blackmailer that can actually follow up on that threat. I would point out that humans usually pay ransoms, so it’s not exactly making a different decision than we would in the same situation.
Or, the AI might anticipate the problem and self-modify in advance to never submit to threats.
I’m worried about a blackmailer that can with positive probability follow up on that threat.
Yes humans behave in the same way, at least according to economists. We pay ransoms when the probability of the threat being carried out, times the disutility that would result from the threat being carried out, is less than the ransom. The difference is that for human-scale threats, this expected disutility does seem to be bounded.
The AI might anticipate the problem and self-modify to never submit to threats
That could mean one of at least two things: either the AI starts to work according to the rules of a (hitherto not conceived?) non-prior-and-utility system. Or the AI calibrates its prior and its utility function so that it doesn’t submit to (some) threats. I think the question is whether something like the second idea can work.
If you’re dealing with a blackmailer that might be able to carry out their threats, then you investigate whether they can or not. The blackmailer themselves might assist you with this, since it’s in their interest to show that their threat is credible.
Allow me to demonstrate: Give $100 to the EFF or I’ll blow up the sun. Do you now assign a higher expected-value utility to giving $100 to the EFF, or to giving the same $100 instead to SIAI? If I blew up the moon as a warning shot, would that change your mind?
If you’re dealing with a blackmailer that might be able to carry out their threats, then you investigate whether they can or not. The blackmailer themselves might assist you with this, since it’s in their interest to show that their threat is credible.
The result of such an investigation might raise or lower P(threat can be carried out). This doesn’t change the shape of the question: can a blackmailer issue a threat with P(threat can be carried out) x U(threat is carried out) > H, for all H? Can it do so at cost to itself that is bounded independent of H?
Allow me to demonstrate: Give $100 to the EFF or I’ll blow up the sun.
I refuse. According to economists, I have just revealed a preference:
P(Pavitra can blow up the sun) x U(Sun) < U(100$)
If I blew up the moon as a warning shot, would that change your mind?
Yes. Now I have revealed
P(Pavitra can blow up the sun | Pavitra has blown up the moon) x U(Sun) > U(100$)
My point is that U($100) is partially dependent on P(Mallory can blow up the sun) x U(Sun), for all values of Mallory and Sun such that Mallory is demanding $100 not to blow up the sun. If P(M_1 can heliocide) is large enough to matter, there’s a very good chance that P(M_2 can heliocide) is too. Credible threats do not occur in a vacuum.
The fact that probability and utility are subjective was perfectly clear to me.
I don’t know what else to say except to reiterate my original point, which I don’t feel you’re addressing:
Consider the proposition that, at some point in my life, someone will try to Pascal’s-mug me and actually back their threats up. In this case, I would still expect to receive a much larger number of false threats over the course of my lifetime. If I hand over all my money to the first mugger without proper verification, I won’t be able to pay up when the real threat comes around.
It’s not even clear to me that you disagree with me. I am proposing a formulation (not a solution!) of Pascal’s mugging problem: if a mugger can issue threats of arbitrarily high expected disutility, then a priors-and-utilities AI is boned. (A little more precisely: then the mugger can extract an arbitrarily large amount of utils from the P-and-U AI.) Are you saying that this statement is false, or just that it leaves out an essential aspect of Pascal’s mugging? Or something else?
I’m saying that this statement is false. The mugger needs also to somehow persuade the AI of the nonexistence of other muggers of similar credibility.
In the real world, muggers usually accomplish this by raising their own credibility beyond the “omg i can blow up the sun” level, such as by brandishing a weapon.
OK let me be a little more careful. The expected disutility the AI associates to a threat is
EU(threat) = P(threat will be carried out) x U(threat will be carried out) + P(threat will not be carried out) x U(threat will not be carried out)
I think that the existence of other muggers with bigger weapons, or just of other dangers and opportunities generally, is accounted for in the second summand.
That formulation seems to fail to distinguish (ransom paid)&(threat not carried out) from (ransom not paid)&(threat not carried out).
There are two courses of actions being considered: pay ransom or don’t pay ransom.
EU(pay ransom) = P(no later real threat) * U(sun safe) + P(later real threat) * U(sun explodes)
EU(don’t pay ransom) = P(threat fake) * ( P(no later real threat) + P(later real threat) * P(later real threat correctly identified as real | later real threat) ) * U(sun safe) + ( P(threat real) + P(threat fake) * P(later real threat) P(later real threat incorrectly identified as fake | later real threat) ) \ U(sun explodes)
That’s completely unreadable. I need symbolic abbreviations.
R=EU(pay ransom); r=EU(don’t pay ransom)
S=U(sun safe); s=U(sun explodes)
T=P(threat real); t=P(threat fake)
L=P(later real threat); M=P(no later real threat)
i=P(later real threat correctly identified as real | later real threat)
j=P(later real threat incorrectly identified as fake | later real threat)
Why so much focus on future threats to the sun? Are you going to argue, by analogy with the prisoner’s dilemma, that the iterated Pascal’s mugging is easier to solve than the one-shot Pascal’s mugging?
That formulation seems to fail to distinguish (ransom paid)&(threat not carried out) from (ransom not paid)&(threat not carried out).
I thought that, either by definition or as a simplifying assumption, EU(ransom paid & threat not carried out) = current utility—size of ransom, and that EU(ransom not paid & threat not carried out) = current utility.
My primary thesis is that the iterated Pascal’s mugging is much more likely to approximate any given real-world situation than the one-shot Pascal’s mugging, and that focusing on the latter is likely to lead by availability heuristic bias to people making bad decisions on important issues.
My primary thesis is that if you have programmed a purported god-like and friendly AI that you know will do poorly in one-shot Pascal’s mugging, then you should not turn it on. Even if you know it will do well in other variations on Pascal’s mugging.
My secondary thesis comes from Polya: “If there’s a problem that you can’t solve, then there’s a simpler problem that you can solve. Find it!” Solutions to, failed solutions to, and ideas about one-shot Pascal’s mugging will illuminate features about iterated Pascal’s mugging and also about many given real-world situations.
(“One-shot”, “iterated”...If these are even good names!)
Forget it. I’m just weirded out that you would respond to “here’s a tentative formalization of a simple version of Pascal’s mugging” with “even thinking about it is dangerous.” I don’t agree and I don’t understand the mindset.
I don’t mean to say that thinking about the one-shot is dangerous, only that grossly overemphasizing it relative to the iterated might be.
I hear about the one-shot all the time, and the iterated not at all, and I think the iterated is more likely to come up than the one-shot, and I think the iterated is easier to solve than the one-shot, so in all I think it’s completely reasonable for me to want to emphasize the iterated.
The iterated has an easy-to-accept-intuitively solution: don’t just randomly accept blackmail from anyone who offers it, but rather investigate first to see if they constitute a credible threat.
The one-shot Pascal’s Mugging, like most one-shot games discussed in game theory, has a harder-to-stomach dominant strategy: pay the ransom, because the mere claim, considered as Bayesian evidence, promotes the threat to much more likely than the reciprocal of its utility-magnitude.
But what a prior-and-utility system means by “credible” is that the expected disutility is large. If a blackmailer can, at finite cost to itself, put our AI in a situation with arbitrarily high expected disutility, then our AI is boned.
Ah, you’re worried about a blackmailer that can actually follow up on that threat. I would point out that humans usually pay ransoms, so it’s not exactly making a different decision than we would in the same situation.
Or, the AI might anticipate the problem and self-modify in advance to never submit to threats.
I’m worried about a blackmailer that can with positive probability follow up on that threat.
Yes humans behave in the same way, at least according to economists. We pay ransoms when the probability of the threat being carried out, times the disutility that would result from the threat being carried out, is less than the ransom. The difference is that for human-scale threats, this expected disutility does seem to be bounded.
That could mean one of at least two things: either the AI starts to work according to the rules of a (hitherto not conceived?) non-prior-and-utility system. Or the AI calibrates its prior and its utility function so that it doesn’t submit to (some) threats. I think the question is whether something like the second idea can work.
No, see, that’s different.
If you’re dealing with a blackmailer that might be able to carry out their threats, then you investigate whether they can or not. The blackmailer themselves might assist you with this, since it’s in their interest to show that their threat is credible.
Allow me to demonstrate: Give $100 to the EFF or I’ll blow up the sun. Do you now assign a higher expected-value utility to giving $100 to the EFF, or to giving the same $100 instead to SIAI? If I blew up the moon as a warning shot, would that change your mind?
The result of such an investigation might raise or lower P(threat can be carried out). This doesn’t change the shape of the question: can a blackmailer issue a threat with P(threat can be carried out) x U(threat is carried out) > H, for all H? Can it do so at cost to itself that is bounded independent of H?
I refuse. According to economists, I have just revealed a preference:
P(Pavitra can blow up the sun) x U(Sun) < U(100$)
Yes. Now I have revealed
P(Pavitra can blow up the sun | Pavitra has blown up the moon) x U(Sun) > U(100$)
My point is that U($100) is partially dependent on P(Mallory can blow up the sun) x U(Sun), for all values of Mallory and Sun such that Mallory is demanding $100 not to blow up the sun. If P(M_1 can heliocide) is large enough to matter, there’s a very good chance that P(M_2 can heliocide) is too. Credible threats do not occur in a vacuum.
I don’t understand your points, can you expand them?
In my inequalities, P and U denoted my subjective probabilities and utilities, in case that wasn’t clear.
The fact that probability and utility are subjective was perfectly clear to me.
I don’t know what else to say except to reiterate my original point, which I don’t feel you’re addressing:
It’s not even clear to me that you disagree with me. I am proposing a formulation (not a solution!) of Pascal’s mugging problem: if a mugger can issue threats of arbitrarily high expected disutility, then a priors-and-utilities AI is boned. (A little more precisely: then the mugger can extract an arbitrarily large amount of utils from the P-and-U AI.) Are you saying that this statement is false, or just that it leaves out an essential aspect of Pascal’s mugging? Or something else?
I’m saying that this statement is false. The mugger needs also to somehow persuade the AI of the nonexistence of other muggers of similar credibility.
In the real world, muggers usually accomplish this by raising their own credibility beyond the “omg i can blow up the sun” level, such as by brandishing a weapon.
OK let me be a little more careful. The expected disutility the AI associates to a threat is
EU(threat) = P(threat will be carried out) x U(threat will be carried out) + P(threat will not be carried out) x U(threat will not be carried out)
I think that the existence of other muggers with bigger weapons, or just of other dangers and opportunities generally, is accounted for in the second summand.
Now does the formulation look OK to you?
That formulation seems to fail to distinguish (ransom paid)&(threat not carried out) from (ransom not paid)&(threat not carried out).
There are two courses of actions being considered: pay ransom or don’t pay ransom.
That’s completely unreadable. I need symbolic abbreviations.
Then:
(p.s.: We really need a preview feature.)
Why so much focus on future threats to the sun? Are you going to argue, by analogy with the prisoner’s dilemma, that the iterated Pascal’s mugging is easier to solve than the one-shot Pascal’s mugging?
I thought that, either by definition or as a simplifying assumption, EU(ransom paid & threat not carried out) = current utility—size of ransom, and that EU(ransom not paid & threat not carried out) = current utility.
My primary thesis is that the iterated Pascal’s mugging is much more likely to approximate any given real-world situation than the one-shot Pascal’s mugging, and that focusing on the latter is likely to lead by availability heuristic bias to people making bad decisions on important issues.
My primary thesis is that if you have programmed a purported god-like and friendly AI that you know will do poorly in one-shot Pascal’s mugging, then you should not turn it on. Even if you know it will do well in other variations on Pascal’s mugging.
My secondary thesis comes from Polya: “If there’s a problem that you can’t solve, then there’s a simpler problem that you can solve. Find it!” Solutions to, failed solutions to, and ideas about one-shot Pascal’s mugging will illuminate features about iterated Pascal’s mugging and also about many given real-world situations.
(“One-shot”, “iterated”...If these are even good names!)
I’m not persuaded that paying the ransom is doing poorly on the one-shot. And if it predictably does the wrong thing, in what sense is it Friendly?
Forget it. I’m just weirded out that you would respond to “here’s a tentative formalization of a simple version of Pascal’s mugging” with “even thinking about it is dangerous.” I don’t agree and I don’t understand the mindset.
I don’t mean to say that thinking about the one-shot is dangerous, only that grossly overemphasizing it relative to the iterated might be.
I hear about the one-shot all the time, and the iterated not at all, and I think the iterated is more likely to come up than the one-shot, and I think the iterated is easier to solve than the one-shot, so in all I think it’s completely reasonable for me to want to emphasize the iterated.
Granted! And
tell me more.
The iterated has an easy-to-accept-intuitively solution: don’t just randomly accept blackmail from anyone who offers it, but rather investigate first to see if they constitute a credible threat.
The one-shot Pascal’s Mugging, like most one-shot games discussed in game theory, has a harder-to-stomach dominant strategy: pay the ransom, because the mere claim, considered as Bayesian evidence, promotes the threat to much more likely than the reciprocal of its utility-magnitude.