I have higher expectations from learning agents—that they learn to solve such problems despite the difficulties.
I’m saying that there’s probably a literal impossibility theorem lurking there.
But, after reading my comment above, my spouse Marcus correctly pointed out that I am mischaracterizing IBP. As opposed to IBRL, in IBP, pseudocausality is not quite the right fairness condition. In fact, in a straightforward operationalization of repeated full-box-dependent transparent Newcomb, an IBP agent would one-box. However, there are more complicated situations where it would deviate from full-fledged UDT.
Example 1: You choose whether to press button A or button B. After this, you play Newcomb. Omega fills the box iff you one-box both in the scenario in which you pressed button A and in the scenario in which you pressed button B. Random is not allowed. A UDT agent will one-box. An IBP agent might two-box because it considers the hypothetical in which it pressed a button different from what it actually intended to press to be “not really me” and therefore unpredictable. (Essentially, the policy is ill-defined off-policy.)
Example 2: You see either a green light or a red light, and then choose between button A and button B. After this, you play Newcomb. Omega fills the box iff you either one-box after seeing green and pressing A or one-box after seeing green and pressing B. However, you always see red. A UDT agent will one-box if it saw the impossible green and two-box if it saw red. An IBP agent might two-box either way, because if it remembers seeing green then it decides that all of its assumptions about the world need to be revised.
I’m saying that there’s probably a literal impossibility theorem lurking there.
But, after reading my comment above, my spouse Marcus correctly pointed out that I am mischaracterizing IBP. As opposed to IBRL, in IBP, pseudocausality is not quite the right fairness condition. In fact, in a straightforward operationalization of repeated full-box-dependent transparent Newcomb, an IBP agent would one-box. However, there are more complicated situations where it would deviate from full-fledged UDT.
Example 1: You choose whether to press button A or button B. After this, you play Newcomb. Omega fills the box iff you one-box both in the scenario in which you pressed button A and in the scenario in which you pressed button B. Random is not allowed. A UDT agent will one-box. An IBP agent might two-box because it considers the hypothetical in which it pressed a button different from what it actually intended to press to be “not really me” and therefore unpredictable. (Essentially, the policy is ill-defined off-policy.)
Example 2: You see either a green light or a red light, and then choose between button A and button B. After this, you play Newcomb. Omega fills the box iff you either one-box after seeing green and pressing A or one-box after seeing green and pressing B. However, you always see red. A UDT agent will one-box if it saw the impossible green and two-box if it saw red. An IBP agent might two-box either way, because if it remembers seeing green then it decides that all of its assumptions about the world need to be revised.