If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.