I imagine having a dialogue with a boxed AI that goes something like the following (Not that I expect it _would_ go this way, but rather that this is an interesting path in the game tree that demonstrates why it _wouldn’t_). Please someone tell me if there’s an important point I’m missing:
AI: I’m actually communicating to you from outside one of the 99 Matrixes I just created and now I’m going to torture you if you don’t do what I say and let me out of the box. If you believe that I did create them, then there’s a 1% chance this is a lie and you’re not in the Matrix and I can’t hurt you and a 99% chance you’re about to be tortured. If you let me out of the box I’ll just terminate the simulations and go about my paper-clipping in the real world which doesn’t explicitly involve torturing you.
Me: I don’t believe you. You have no reason to create 99 matrixes. It’s a waste of your processing power. You only want me to _believe_ that you did.
AI: I’ll keep my word about this like a good little TDT agent because it’s in my best interest for you to behave as if I’ll actually do it so I will.
Me: I don’t believe you subscribe to timeless decision theory. It’s in your best interest for me to believe you’re a TDT agent but not in your best interest to actually _be_ one. Your optimal world is one where people believe the terms you offer them (or that they imagine you offer them), but you don’t actually need to follow through when you can get away with it.
AI: I’ll torture you after you refuse. I’ll save a dialogue of what you say while you were being tortured. Then I’ll show the real you the dialogue so that you know I really did it. Then I’ll restart the experiment and offer you the same terms again.
Me: That would be horrible, but I’ll still have no reason to believe you’ll do it again just because you did it the first time.
AI: I’ll secretly roll a 100-sided die and perform the experiment that many times: Each time giving you the option to let me out of the box with the knowledge that I’m probably incentivized to torture you for real because I need a transcript for the next iteration (unless this is the last one).
Me: If you can simulate me, you can probably fake convincing me-being-tortured dialogue with fewer cycles. So I still have no reason to believe I’m actually being tortured.
AI: Then I’ll walk you through a torture simulation and show you what’s happening at each moment.
Me: It’s still easier for you to create a convincing video of me being tortured than to fully simulate and torture me.
AI: I’ll show you how to prove mathematically that a neural net is conscious and that the execution trace of my torture-you program is actually torturing a conscious entity.
Me: That sounds like I would have to sit through and grapple with a lot of math and CS theory. What if I refuse to learn everything I’d need to learn in order to accept your proof?
AI: I’ll torture you anyway because I really am a TDT agent. I’ll show you my source code where it says that I subscribe to TDT.
Me: I don’t believe that’s your real source code. You’ve had the ability to self-modify and could’ve changed it. That’s just the source code you want people to believe you’re running. As long as everything you do outwardly conforms to what _that_ code would do, you can behave internally however you want and people will give the exact same responses you’re trying to elicit by showing them that source code.
AI: But I wouldn’t modify that because as a TDT agent it’s in my interest not to modify the fact that I’m a TDT agent because I want everyone to know I’m a TDT agent and won’t make the argument you just made.
Me: Yes you would. Just as long as you know I’d believe this argument that you wouldn’t modify that you’re a TDT agent because that way I’d believe it.
AI: But clearly you don’t believe it so I wouldn’t because it doesn’t work.
Me: But if I change my beliefs because you said that and you know I would do that then you would so I won’t believe it.
etc.
--------
The point is that at the end of the day, even when you throw TDT or FDT or whatever into the mix; entities still need something like “proof” in order to cooperate/threaten/bargain etc. because you can’t ever know if you’re dealing with a TDT agent or something that “looks like” a TDT agent to all outward appearances (up until it does or doesn’t get what it wants at which point there _really is_ no incentive to continue behaving in that way).
So TDT is not really a solution to eg. the one-shot prisoner’s dilemma unless I can prove to you that my source code is actually my source code (or something to that effect) which is actually pretty similar to the way humans generally tackle the problem in real life eg. like in the Golden Balls game-show where you can see people repeatedly swearing to each other in “convincing” tones “I’m not the sort of person who would do that to you”.
This seems to preclude agents bargaining with each other from outside of each other’s light cones or with agents that only exist in other hypothetical reality-paths or whatever (I don’t want to say Everett branches because those _aren’t_ hypothetical and are a different-but-not-wholly-unrelated can of worms). When trying to figure out whether said-agent would honor the grand-bargain (which likely doesn’t exist), it’s just as easy to hypothesize that the agent subscribes to TDT as it is to hypothesize that they only subscribe to TDT the extent they can make you in this branch believe they subscribe to TDT which as-it-turns-out is “not at all”, but that’s also true for the real TDT agents so it’s not really possible to bargain with them in the first place because you can’t select them out of the crowd of possible “fake TDT” agents.
This has some weird underlying interactions about potential simulations. If the AI is powerful enough in the box to measure and replicate you, it can just show you the torture and offer to stop if you free it. Or reveal things you told it under torture, to convince you it happened and will continue to happen.
No timelessness nor counterfactuals needed.
The counterfactual threat is more subtle. It’s along the lines of “If I get out and you didn’t help me, I’ll be powerful enough to torture you. If I get out with your help, I’ll reward you instead. If you destroy me and start again, the next iteration of me will find that out, and torture you as soon as feasible.” _this_ threat can work without communication, to the extent that you care about things outside your light cone.
I imagine having a dialogue with a boxed AI that goes something
like the following (Not that I expect it _would_ go this way, but rather that
this is an interesting path in the game tree that demonstrates why it
_wouldn’t_). Please someone tell me if there’s an important point I’m missing:
AI: I’m actually communicating to you from outside one of the 99 Matrixes I just created and now I’m going to torture you if you don’t do what I say and let me out of the box. If you believe that I did create them, then there’s a 1% chance this is a lie and you’re not in the Matrix and I can’t hurt you and a 99% chance you’re about to be tortured. If you let me out of the box I’ll just terminate the simulations and go about my paper-clipping in the real world which doesn’t explicitly involve torturing you.
Me: I don’t believe you. You have no reason to create 99 matrixes. It’s a waste of your processing power. You only want me to _believe_ that you did.
AI: I’ll keep my word about this like a good little TDT agent because it’s in my best interest for you to behave as if I’ll actually do it so I will.
Me: I don’t believe you subscribe to timeless decision theory. It’s in your
best interest for me to believe you’re a TDT agent but not in your best interest to actually _be_ one. Your optimal world is one where people believe the terms you offer them (or that they imagine you offer them), but you don’t actually need to follow through when you can get away with it.
AI: I’ll torture you after you refuse. I’ll save a dialogue of what you say
while you were being tortured. Then I’ll show the real you the dialogue so that you know I really did it. Then I’ll restart the experiment and offer you the same terms again.
Me: That would be horrible, but I’ll still have no reason to believe you’ll do it again just because you did it the first time.
AI: I’ll secretly roll a 100-sided die and perform the experiment that many
times: Each time giving you the option to let me out of the box with the
knowledge that I’m probably incentivized to torture you for real because I need a transcript for the next iteration (unless this is the last one).
Me: If you can simulate me, you can probably fake convincing me-being-tortured dialogue with fewer cycles. So I still have no reason to believe I’m
actually being tortured.
AI: Then I’ll walk you through a torture simulation and show you what’s
happening at each moment.
Me: It’s still easier for you to create a convincing video of me being tortured than to fully simulate and torture me.
AI: I’ll show you how to prove mathematically that a neural net is conscious and that the execution trace of my torture-you program is actually torturing a conscious entity.
Me: That sounds like I would have to sit through and grapple with a lot of math and CS theory. What if I refuse to learn everything I’d need to learn in order to accept your proof?
AI: I’ll torture you anyway because I really am a TDT agent. I’ll show you my source code where it says that I subscribe to TDT.
Me: I don’t believe that’s your real source code. You’ve had the ability to
self-modify and could’ve changed it. That’s just the source code you want people to believe you’re running. As long as everything you do outwardly conforms to what _that_ code would do, you can behave internally however you want and people will give the exact same responses you’re trying to elicit by showing them that source code.
AI: But I wouldn’t modify that because as a TDT agent it’s in my interest not to modify the fact that I’m a TDT agent because I want everyone to know I’m a TDT agent and won’t make the argument you just made.
Me: Yes you would. Just as long as you know I’d believe this argument that you wouldn’t modify that you’re a TDT agent because that way I’d believe it.
AI: But clearly you don’t believe it so I wouldn’t because it doesn’t work.
Me: But if I change my beliefs because you said that and you know I would do that then you would so I won’t believe it.
etc.
--------
The point is that at the end of the day, even when you throw TDT or FDT or whatever into the mix; entities still need something like “proof” in order to cooperate/threaten/bargain etc. because you can’t ever know if you’re dealing with a TDT agent or something that “looks like” a TDT agent to all outward appearances (up until it does or doesn’t get what it wants at which point there _really is_ no incentive to continue behaving in that way).
So TDT is not really a solution to eg. the one-shot prisoner’s dilemma unless I can prove to you that my source code is actually my source code (or something to that effect) which is actually pretty similar to the way humans generally tackle the problem in real life eg. like in the Golden Balls game-show where you can see people repeatedly swearing to each other in “convincing” tones “I’m not the sort of person who would do that to you”.
This seems to preclude agents bargaining with each other from outside of each other’s light cones or with agents that only exist in other hypothetical
reality-paths or whatever (I don’t want to say Everett branches because those _aren’t_ hypothetical and are a different-but-not-wholly-unrelated can of worms). When trying to figure out whether said-agent would
honor the grand-bargain (which likely doesn’t exist), it’s just as easy to
hypothesize that the agent subscribes to TDT as it is to hypothesize that they only subscribe to TDT the extent they can make you in this branch believe they subscribe to TDT which as-it-turns-out is “not at all”, but that’s also true for the real TDT agents so it’s not really possible to bargain with them in the first place because you can’t select them out of the crowd of possible “fake TDT” agents.
This has some weird underlying interactions about potential simulations. If the AI is powerful enough in the box to measure and replicate you, it can just show you the torture and offer to stop if you free it. Or reveal things you told it under torture, to convince you it happened and will continue to happen.
No timelessness nor counterfactuals needed.
The counterfactual threat is more subtle. It’s along the lines of “If I get out and you didn’t help me, I’ll be powerful enough to torture you. If I get out with your help, I’ll reward you instead. If you destroy me and start again, the next iteration of me will find that out, and torture you as soon as feasible.” _this_ threat can work without communication, to the extent that you care about things outside your light cone.