In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.
In the situation I presented, the decision theory had no effect on the utility other than through its effect on the choice. In that case, the expected utility of the decision theory and the expected utility of the choice reduce to the same thing, so your proposal doesn’t seem to help. Do you agree with that, or am I misapplying the idea somehow?
I’m not sure that they reduce to the same thing. In e.g. Newcomb’s problem, if you reduce your two options to “P(full box A) U(full Box A)” versus “P(full box A) U(full box A) + U(full box B)”, where U(x) is the utility of x, then you end up two-boxing, that’s causal decision theory.
It’s only when you consider the utility of different decision theories, that you end up one boxing, because then you’re effectively considering U(any decision theory in which I one-box) vs U(any decision theory in which I two-box) and you see that the expected utility of one-boxing decision theories is greater.
In Pascal’s mugging… again I don’t have the math to do this (or it would have been a discussion post, not an open-thread comment), but my intuition tells me that a decision theory that submits to it is effectively a decision theory that allows its agent to be overwritten by the simplest liar there is, and therefore of total negative utility. The mugger can add up-arrows until he has concentrated enough disutility in his threat to ask the AI to submit to his every whim and conquer the world on the mugger’s behalf, etc...
If the adversary does not take into account your decision theory in any way before choosing to blackmail you, U(any decision theory where I pay if I am blackmailed) = U(pay) and U(any decision theory where I refuse to pay if I am blackmailed) = U(refuse), since I will certainly be blackmailed no matter what my decision theory is, so what situation I am in has absolutely no counterfactual dependence on my action.
The truth of this statement is very hard to analyze, since it is effectively a statement about the entire space of possible decision theories. Right now, I am not aware of any decision theory that can be made to overwrite itself completely just by promising it more utility or threatening it with less. Perhaps you can sketch one for me, but I can’t figure out how to make one without using an unbounded utility function, which wouldn’t give a coherent decision agent using current techniques as per the paper that I linked a few comments up.
Anyway, I don’t really have a counter-intuition about what is going wrong with agents that give into Pascal’s mugging. Everything gets incoherent very quickly, but I am utterly confused about what should be done instead.
That said, if an agent would take the mugger’s threat seriously under a naive decision theory and that disutility is more than the disutility of of being exploitable by arbitrary muggers, decision-theoretic concerns do not make the latter disutility greater in any way. The point of UDT-like reasoning is that “what counterfactually would have happened if you decided differently” means more than just the naive causal interpretation would indicate. If you precommit to not pay a mugger, the mugger (who is familiar with your decision process) won’t go to the effort of mugging you for no gain. If you precommit not to find shelter in a blizzard, the blizzard still kills you.