One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
Except it needs to convince the people who are around before it exists.