If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger
This isn’t about beliefs, this is about decisions. The process of epistemic rationality needn’t be modified, only the process of instrumental rationality. Regardless of how much probability the AI assigns to the danger for 3^^^^3 people, it needn’t be the right choice to decide based on a mere probability of such danger multiplied to the disutility of the harm done.
Saving 3^^^^3 people is more than worth a bit of vulnerability to blackmail. If 3^^^^3 people are in danger, the AI wishes to believe 3^^^^3 people are in danger and in that case “never surrender to blackmail” is a strictly worse strategy.
Unless having the decision process that surrenders to blackmail and being known to have it is what will put these people in danger in the first place. In that case, either you modify your decision process so that you precommit to not surrender to blackmail and prove it to other people in advance, or pretend to not surrender and submit to individual blackmails if enough secrecy of such submission can be ensured so that future agents won’t be likely to be encouraged to blackmail.
But this was just an example of an alternate decision theory, e.g. one that had hardwired exceptions against blackmail. I’m not actually saying it need be anything as absolute or simple as that—if it were as simple as that I’d have solved the Pascal’s Mugger problem by saying “TDT plus don’t submit to blackmail” instead of saying “weigh against your decision process by a factor proportional to its exploitability potential”
We seem to be thinking of slightly different problems. I wasn’t thinking of the mugger’s decision to blackmail you as dependent on their estimate that you will give in. There are possible muggers who will blackmail you regardless of your decision theory and refusing to submit to blackmail would cause them to produce large negative utilities.
And as I said my example about a blanket refusal to submit to blackmail was just an example. My more general point is to evaluate the expected utility of your decision theory itself, not just the individual decision.
In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.
This isn’t about beliefs, this is about decisions. The process of epistemic rationality needn’t be modified, only the process of instrumental rationality. Regardless of how much probability the AI assigns to the danger for 3^^^^3 people, it needn’t be the right choice to decide based on a mere probability of such danger multiplied to the disutility of the harm done.
Unless having the decision process that surrenders to blackmail and being known to have it is what will put these people in danger in the first place. In that case, either you modify your decision process so that you precommit to not surrender to blackmail and prove it to other people in advance, or pretend to not surrender and submit to individual blackmails if enough secrecy of such submission can be ensured so that future agents won’t be likely to be encouraged to blackmail.
But this was just an example of an alternate decision theory, e.g. one that had hardwired exceptions against blackmail. I’m not actually saying it need be anything as absolute or simple as that—if it were as simple as that I’d have solved the Pascal’s Mugger problem by saying “TDT plus don’t submit to blackmail” instead of saying “weigh against your decision process by a factor proportional to its exploitability potential”
We seem to be thinking of slightly different problems. I wasn’t thinking of the mugger’s decision to blackmail you as dependent on their estimate that you will give in. There are possible muggers who will blackmail you regardless of your decision theory and refusing to submit to blackmail would cause them to produce large negative utilities.
And as I said my example about a blanket refusal to submit to blackmail was just an example. My more general point is to evaluate the expected utility of your decision theory itself, not just the individual decision.
In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.