Doing what you describe requires something like logical updatelessness, which UDT doesn’t do, and which we don’t know how to do in general. I think this was described in the post. Also, even if thinking more doesn’t allow someone to exploit you, it might cause you to miss a chance to exploit someone else, or to cooperate with someone else, because it makes you too hard to predict.
I don’t know what logical updatelessness means, and I don’t see where the article describes this, but I’ll just try to formalize what I describe, since you seem to imply that would be novel.
Let UDT:=argmaxamaxu□("UDT=a⟹utility≥u") . Pitted against itself in modal combat, it’ll get at least the utility of (C,C) because “UDT = C ⇒ utility is that of (C,C)” is provable. In chicken, UDT is “swerve unless the opponent provably will”. UDT will swerve against itself (though that’s not provable), against the madman “never swerve” and against “swerve unless the opponent swerves against a madman” (which requires UDT’s opponent to disentangle UDT from its environment). UDT won’t swerve against “swerve if the opponent doesn’t” or “swerve if it’s provable the opponent doesn’t”.
Attempting to exploit the opponent sounds to me like “self-modify into a madman if it’s provable that that will make the opponent swerve”, but that’s just UDT.
Suppose you (as a human) are playing chicken against this version of UDT, which has vastly more computing power than you and could simulate your decisions in its proofs. Would you swerve?
I wouldn’t, because I would reason that if I didn’t swerve, UDT would simulate that and conclude that not swerving leads to the highest utility. You said “By deliberately crashing into the formerly smart madman, UDT can retroactively erase the situation.” but I don’t see how this version of UDT does that.
I don’t know what logical updatelessness means, and I don’t see where the article describes this
You’re right, the post kind of just obliquely mentions it and assumes the reader already knows the concept, in this paragraph:
Both agents race to decide how to decide first. Each strives to understand the other agent’s behavior as a function of its own, to select the best policy for dealing with the other. Yet, such examination of the other needs to itself be done in an updateless way. It’s a race to make the most uninformed decision.
Not sure what’s a good reference for logical updatelessness. Maybe try some of these posts? The basic idea is just that even if you manage to prove that your opponent doesn’t swerve, you perhaps shouldn’t “update” on that and then make your own decision while assuming that as a fixed fact that can’t be changed.
If I didn’t assume PA is consistent, I would swerve because I wouldn’t know whether UDT might falsely prove that I swerve. Since PA is consistent and I assume this, I am in fact better at predicting UDT than UDT is at predicting itself, and it swerves while I don’t. Can you find a strategy that beats UDT, doesn’t disentangle its opponent from the environment, swerves against itself and “doesn’t assume UDT’s proof system is consistent”?
It sounds like you mentioned logical updatelessness because my version of UDT does not trust a proof of “u = …”, it wants the whole set of proofs of “u >= …”. I’m not yet convinced that there are any other proofs it must not trust.
Doing what you describe requires something like logical updatelessness, which UDT doesn’t do, and which we don’t know how to do in general. I think this was described in the post. Also, even if thinking more doesn’t allow someone to exploit you, it might cause you to miss a chance to exploit someone else, or to cooperate with someone else, because it makes you too hard to predict.
I don’t know what logical updatelessness means, and I don’t see where the article describes this, but I’ll just try to formalize what I describe, since you seem to imply that would be novel.
Let UDT:=argmaxamaxu□("UDT=a⟹utility≥u") . Pitted against itself in modal combat, it’ll get at least the utility of (C,C) because “UDT = C ⇒ utility is that of (C,C)” is provable. In chicken, UDT is “swerve unless the opponent provably will”. UDT will swerve against itself (though that’s not provable), against the madman “never swerve” and against “swerve unless the opponent swerves against a madman” (which requires UDT’s opponent to disentangle UDT from its environment). UDT won’t swerve against “swerve if the opponent doesn’t” or “swerve if it’s provable the opponent doesn’t”.
Attempting to exploit the opponent sounds to me like “self-modify into a madman if it’s provable that that will make the opponent swerve”, but that’s just UDT.
Suppose you (as a human) are playing chicken against this version of UDT, which has vastly more computing power than you and could simulate your decisions in its proofs. Would you swerve?
I wouldn’t, because I would reason that if I didn’t swerve, UDT would simulate that and conclude that not swerving leads to the highest utility. You said “By deliberately crashing into the formerly smart madman, UDT can retroactively erase the situation.” but I don’t see how this version of UDT does that.
You’re right, the post kind of just obliquely mentions it and assumes the reader already knows the concept, in this paragraph:
Not sure what’s a good reference for logical updatelessness. Maybe try some of these posts? The basic idea is just that even if you manage to prove that your opponent doesn’t swerve, you perhaps shouldn’t “update” on that and then make your own decision while assuming that as a fixed fact that can’t be changed.
If I didn’t assume PA is consistent, I would swerve because I wouldn’t know whether UDT might falsely prove that I swerve. Since PA is consistent and I assume this, I am in fact better at predicting UDT than UDT is at predicting itself, and it swerves while I don’t. Can you find a strategy that beats UDT, doesn’t disentangle its opponent from the environment, swerves against itself and “doesn’t assume UDT’s proof system is consistent”?
It sounds like you mentioned logical updatelessness because my version of UDT does not trust a proof of “u = …”, it wants the whole set of proofs of “u >= …”. I’m not yet convinced that there are any other proofs it must not trust.