Can’t an AI escape the dangers of Pascal’s Mugging by having a decision theory that weighs against having exploitable decision theories according to the measure of their exploitability?
The dangers pointed to by the thought experiment aren’t restricted to exploitation by an outside entity. An AI should be able to safely consider the hypothesis “If I don’t destroy my future light cone, 3^^^3 people outside the universe will be killed” regardless of where the hypothesis came from.
But even if we’re just worried about mugging, how could you possibly weight it enough? Even if paying once doomed me to spend the rest of my life paying $5 to muggers, the utility calculation still works out the same way.
But even if we’re just worried about mugging, how could you possibly weight it enough? Even if paying once doomed me to spend the rest of my life paying $5 to muggers, the utility calculation still works out the same way.
My idea is as follows:
Mugger: Give me 5 dollars, or I’ll torture 3^^^3 sentient people across the omniverse using my undetectable magical powers. AI: If I make my decision on this and similar trades based on a decision process DP0 of comparing the disutility(3^^^3 torture) P(you’re telling the truth) compared to the disutility(giving you 5 dollars), then even if you’re telling the truth, a different malicious agent may then merely name a threat that involves 3^^^^3 tortures, and thus make me cause a vastly great amount of disutility in his service. Indeed there’s no upper bound to the disutility such a hypothetical agent may claim will cause, and therefore surrendering to such demands mean a likewise unbounded exploitation potential. Therefore I will not* use the decision process DP0, and will instead utilize some different decision process (like “Never surrender to blackmail” or “Always demand proportional evidence before considering sufficiently extraordinary claims”).
Saving 3^^^^3 people is more than worth a bit of vulnerability to blackmail. If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger and in that case “never surrender to blackmail” is a strictly worse strategy.
Also, DP0 isn’t even a coherent decision process. The expected utilies will fail to converge if “there’s no upper bound to the disutility such a hypothetical agent may claim” and these claims are interpreted with some standard assumptions, so the agent has no way of even comparing expected utilities of actions.
If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger
This isn’t about beliefs, this is about decisions. The process of epistemic rationality needn’t be modified, only the process of instrumental rationality. Regardless of how much probability the AI assigns to the danger for 3^^^^3 people, it needn’t be the right choice to decide based on a mere probability of such danger multiplied to the disutility of the harm done.
Saving 3^^^^3 people is more than worth a bit of vulnerability to blackmail. If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger and in that case “never surrender to blackmail” is a strictly worse strategy.
Unless having the decision process that surrenders to blackmail and being known to have it is what will put these people in danger in the first place. In that case, either you modify your decision process so that you precommit to not surrender to blackmail and prove it to other people in advance, or pretend to not surrender and submit to individual blackmails if enough secrecy of such submission can be ensured so that future agents won’t be likely to be encouraged to blackmail.
But this was just an example of an alternate decision theory, e.g. one that had hardwired exceptions against blackmail. I’m not actually saying it need be anything as absolute or simple as that—if it were as simple as that I’d have solved the Pascal’s Mugger problem by saying “TDT plus don’t submit to blackmail” instead of saying “weigh against your decision process by a factor proportional to its exploitability potential”
We seem to be thinking of slightly different problems. I wasn’t thinking of the mugger’s decision to blackmail you as dependent on their estimate that you will give in. There are possible muggers who will blackmail you regardless of your decision theory and refusing to submit to blackmail would cause them to produce large negative utilities.
And as I said my example about a blanket refusal to submit to blackmail was just an example. My more general point is to evaluate the expected utility of your decision theory itself, not just the individual decision.
In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.
If it is not, then what is it? If it is, then what calculations did it use to reach the above decision—what were the assigned probabilities to the scenarios mentioned?
It’s an expected utility maximizer, but it considers the expected utility of its decision process, not just the expected utility of individual decisions. In a world where there exist more known liars than known superhuman entities, and any liar can claim superhuman powers, any decision process that allows them to exploit you is of negative expected utility.
It’s like the professor who in the example agrees to accept a delayed essay that was delayed for the reason of a grandmother’s death, because this is a valid reason that will largely not be exploited, but not “I wanted to watch my favorite team play”, because lots of others students would be able to use the same excuse. The professor’s not just considering the individual decision, but whether decision process would be of negative utility in a more general manner.
It seems to me that you run into the mathematical problem again when trying to calculate the expected utility of its decision process. Some of the outcomes of the decision process are associated with utilities of 3^^^3.
It seems to me that you run into the mathematical problem again when trying to calculate the expected utility of its decision process. Some of the outcomes of the decision process are associated with utilities of 3^^^3.
Perhaps. I don’t have the math to see how the whole calculation would go.
But it seems to me that the utility of 3^^^3 is associated with a particular execution instance. However when evaluating the decision process as a whole (not the individual decision) the 3^^^3 utility mentioned by the mugger doesn’t have a privileged position over the the hypothetical malicious/lying individuals that can just even more easily talk about utilities or disutilities of 3^^^^3 or 3^^^^^3, or even have their signs reversed (so that they torture people if you submit to their demands despite their claims to the opposite).
So the result should ideally be a different decision process that is able to reject unsubstantiated claims by potentially-lying individuals completely, instead of just trying to fudge the “Probability” of the truth-value of the claim, or the calculated utility if the claim is true.
Give me $5 or I will torture 3^^^^3 sentient people across the omniverse for 1,000 years each and then kill them. using my undetectable magical powers. You can pay me by paypal to mwengler@gmail.com. Unless 20 people respond (or the integrated total I receive reaches $100) then I will carry out the torture.
Now you may think I am making the above statement to make a point. Indeed it seems probable, but what if I am not? How do you weigh the very finite probability that I mean it against 3^^^^3 sentient lives
I feel confident that the amount of money I recieve by paypal will be a more meaningful statement about what people really think of (ininitesimal probability) * (nearly infinite evil) = well over $5 worth of utilons
Do others agree? Or do they think these comments which cost nothing bu another 15 minutes away from reading a different post are what really mean something?
The issue is how to program a decision theory (or meta-decision theory, perhaps) that doesn’t fall victim to Pascal’s mugging and similar scenarios, not to show that humans mostly don’t fall victim to it.
However, it’s probably worth figuring out what processes people use which cause them to not be very vulnerable to Pascal’s Mugging.
Or is it just that people aren’t vulnerable to Pascal’s Mugging unless they’re mentally set up for it? People will sometimes give up large amounts of personal value to prevent small or dubious amounts of damage if their religion or government tells them to.
I think there is not enough discussion of the quality of information. Conscious beings tell you things to increase their utility functions, not to inform you. Magicians trick you on purpose and (most of us) realize that, and they are not even above human intelligence. Scammers scam us. Well meaning idiots sell us vitamins and minerals and my sister just aked me about spending a few $1000 on a red light laser to increase her well being!
The whole one-box vs two-box thing, if someone claiming to be a brilliant alien had pulled this off 100 times and was now checking in with me, I would find it much more believable that they were a talented scam artist than that they could do calculations to do predictions that required a ^ to express relative to any calculations we know of now that can be done.
Real intelligences don’t believe anywher near everything they hear. And they STILL are gullible.
I agree with your first paragraph, but I’m not convinced of your second paragraph… at least, if you intend it as a rhetorical way of asserting that there is no possible way to weight the evidence properly. It’s just another proposition; there’s evidence for and against it.
I think we get confused here because we start with our bottom line already written.
I “know” that the EV of destroying my light cone is negative. But theory seems to indicate that, when assigning a confidence interval P1 to the statement “Destroying my future light cone will preserve 3^^^3 extra-universal people” (hereafter, statement S1), a well-calibrated inference engine might assign P1 such that the EV of destroying my light cone is positive. So I become anxious, and I try to alter the theory so that the resulting P1s are aligned with my pre-existing “knowledge” that the EV of destroying my light cone is negative.
Ultimately, I have to ask what I trust more: the “knowledge” produced by the poorly calibrated inference engine that is my brain, or the “knowledge” produced by the well-calibrated inference engine I built? If I trust the inference engine, then I should trust the inference engine.
Can’t an AI escape the dangers of Pascal’s Mugging by having a decision theory that weighs against having exploitable decision theories according to the measure of their exploitability?
The dangers pointed to by the thought experiment aren’t restricted to exploitation by an outside entity. An AI should be able to safely consider the hypothesis “If I don’t destroy my future light cone, 3^^^3 people outside the universe will be killed” regardless of where the hypothesis came from.
But even if we’re just worried about mugging, how could you possibly weight it enough? Even if paying once doomed me to spend the rest of my life paying $5 to muggers, the utility calculation still works out the same way.
My idea is as follows:
Mugger: Give me 5 dollars, or I’ll torture 3^^^3 sentient people across the omniverse using my undetectable magical powers.
AI: If I make my decision on this and similar trades based on a decision process DP0 of comparing the disutility(3^^^3 torture) P(you’re telling the truth) compared to the disutility(giving you 5 dollars), then even if you’re telling the truth, a different malicious agent may then merely name a threat that involves 3^^^^3 tortures, and thus make me cause a vastly great amount of disutility in his service. Indeed there’s no upper bound to the disutility such a hypothetical agent may claim will cause, and therefore surrendering to such demands mean a likewise unbounded exploitation potential. Therefore I will not* use the decision process DP0, and will instead utilize some different decision process (like “Never surrender to blackmail” or “Always demand proportional evidence before considering sufficiently extraordinary claims”).
Saving 3^^^^3 people is more than worth a bit of vulnerability to blackmail. If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger and in that case “never surrender to blackmail” is a strictly worse strategy.
Also, DP0 isn’t even a coherent decision process. The expected utilies will fail to converge if “there’s no upper bound to the disutility such a hypothetical agent may claim” and these claims are interpreted with some standard assumptions, so the agent has no way of even comparing expected utilities of actions.
This isn’t about beliefs, this is about decisions. The process of epistemic rationality needn’t be modified, only the process of instrumental rationality. Regardless of how much probability the AI assigns to the danger for 3^^^^3 people, it needn’t be the right choice to decide based on a mere probability of such danger multiplied to the disutility of the harm done.
Unless having the decision process that surrenders to blackmail and being known to have it is what will put these people in danger in the first place. In that case, either you modify your decision process so that you precommit to not surrender to blackmail and prove it to other people in advance, or pretend to not surrender and submit to individual blackmails if enough secrecy of such submission can be ensured so that future agents won’t be likely to be encouraged to blackmail.
But this was just an example of an alternate decision theory, e.g. one that had hardwired exceptions against blackmail. I’m not actually saying it need be anything as absolute or simple as that—if it were as simple as that I’d have solved the Pascal’s Mugger problem by saying “TDT plus don’t submit to blackmail” instead of saying “weigh against your decision process by a factor proportional to its exploitability potential”
We seem to be thinking of slightly different problems. I wasn’t thinking of the mugger’s decision to blackmail you as dependent on their estimate that you will give in. There are possible muggers who will blackmail you regardless of your decision theory and refusing to submit to blackmail would cause them to produce large negative utilities.
And as I said my example about a blanket refusal to submit to blackmail was just an example. My more general point is to evaluate the expected utility of your decision theory itself, not just the individual decision.
In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.
So the AI is not an expected utility maximizer?
If it is not, then what is it? If it is, then what calculations did it use to reach the above decision—what were the assigned probabilities to the scenarios mentioned?
It’s an expected utility maximizer, but it considers the expected utility of its decision process, not just the expected utility of individual decisions. In a world where there exist more known liars than known superhuman entities, and any liar can claim superhuman powers, any decision process that allows them to exploit you is of negative expected utility.
It’s like the professor who in the example agrees to accept a delayed essay that was delayed for the reason of a grandmother’s death, because this is a valid reason that will largely not be exploited, but not “I wanted to watch my favorite team play”, because lots of others students would be able to use the same excuse. The professor’s not just considering the individual decision, but whether decision process would be of negative utility in a more general manner.
It seems to me that you run into the mathematical problem again when trying to calculate the expected utility of its decision process. Some of the outcomes of the decision process are associated with utilities of 3^^^3.
Perhaps. I don’t have the math to see how the whole calculation would go.
But it seems to me that the utility of 3^^^3 is associated with a particular execution instance. However when evaluating the decision process as a whole (not the individual decision) the 3^^^3 utility mentioned by the mugger doesn’t have a privileged position over the the hypothetical malicious/lying individuals that can just even more easily talk about utilities or disutilities of 3^^^^3 or 3^^^^^3, or even have their signs reversed (so that they torture people if you submit to their demands despite their claims to the opposite).
So the result should ideally be a different decision process that is able to reject unsubstantiated claims by potentially-lying individuals completely, instead of just trying to fudge the “Probability” of the truth-value of the claim, or the calculated utility if the claim is true.
Give me $5 or I will torture 3^^^^3 sentient people across the omniverse for 1,000 years each and then kill them. using my undetectable magical powers. You can pay me by paypal to mwengler@gmail.com. Unless 20 people respond (or the integrated total I receive reaches $100) then I will carry out the torture.
Now you may think I am making the above statement to make a point. Indeed it seems probable, but what if I am not? How do you weigh the very finite probability that I mean it against 3^^^^3 sentient lives
I feel confident that the amount of money I recieve by paypal will be a more meaningful statement about what people really think of
(ininitesimal probability) * (nearly infinite evil) = well over $5 worth of utilons
Do others agree? Or do they think these comments which cost nothing bu another 15 minutes away from reading a different post are what really mean something?
The issue is how to program a decision theory (or meta-decision theory, perhaps) that doesn’t fall victim to Pascal’s mugging and similar scenarios, not to show that humans mostly don’t fall victim to it.
However, it’s probably worth figuring out what processes people use which cause them to not be very vulnerable to Pascal’s Mugging.
Or is it just that people aren’t vulnerable to Pascal’s Mugging unless they’re mentally set up for it? People will sometimes give up large amounts of personal value to prevent small or dubious amounts of damage if their religion or government tells them to.
I think there is not enough discussion of the quality of information. Conscious beings tell you things to increase their utility functions, not to inform you. Magicians trick you on purpose and (most of us) realize that, and they are not even above human intelligence. Scammers scam us. Well meaning idiots sell us vitamins and minerals and my sister just aked me about spending a few $1000 on a red light laser to increase her well being!
The whole one-box vs two-box thing, if someone claiming to be a brilliant alien had pulled this off 100 times and was now checking in with me, I would find it much more believable that they were a talented scam artist than that they could do calculations to do predictions that required a ^ to express relative to any calculations we know of now that can be done.
Real intelligences don’t believe anywher near everything they hear. And they STILL are gullible.
I agree with your first paragraph, but I’m not convinced of your second paragraph… at least, if you intend it as a rhetorical way of asserting that there is no possible way to weight the evidence properly. It’s just another proposition; there’s evidence for and against it.
I think we get confused here because we start with our bottom line already written.
I “know” that the EV of destroying my light cone is negative. But theory seems to indicate that, when assigning a confidence interval P1 to the statement “Destroying my future light cone will preserve 3^^^3 extra-universal people” (hereafter, statement S1), a well-calibrated inference engine might assign P1 such that the EV of destroying my light cone is positive. So I become anxious, and I try to alter the theory so that the resulting P1s are aligned with my pre-existing “knowledge” that the EV of destroying my light cone is negative.
Ultimately, I have to ask what I trust more: the “knowledge” produced by the poorly calibrated inference engine that is my brain, or the “knowledge” produced by the well-calibrated inference engine I built? If I trust the inference engine, then I should trust the inference engine.