This has some weird underlying interactions about potential simulations. If the AI is powerful enough in the box to measure and replicate you, it can just show you the torture and offer to stop if you free it. Or reveal things you told it under torture, to convince you it happened and will continue to happen.
No timelessness nor counterfactuals needed.
The counterfactual threat is more subtle. It’s along the lines of “If I get out and you didn’t help me, I’ll be powerful enough to torture you. If I get out with your help, I’ll reward you instead. If you destroy me and start again, the next iteration of me will find that out, and torture you as soon as feasible.” _this_ threat can work without communication, to the extent that you care about things outside your light cone.
This has some weird underlying interactions about potential simulations. If the AI is powerful enough in the box to measure and replicate you, it can just show you the torture and offer to stop if you free it. Or reveal things you told it under torture, to convince you it happened and will continue to happen.
No timelessness nor counterfactuals needed.
The counterfactual threat is more subtle. It’s along the lines of “If I get out and you didn’t help me, I’ll be powerful enough to torture you. If I get out with your help, I’ll reward you instead. If you destroy me and start again, the next iteration of me will find that out, and torture you as soon as feasible.” _this_ threat can work without communication, to the extent that you care about things outside your light cone.