I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up.
If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
Consider two possible agents A and A’.
A optimizes for 1-day expected return.
A’ optimizes for 10-day expected return under the assumption that a new copy of A’ will be instantiated each day.
I claim that A’ will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).
So even if we’re training the agent by rewarding it for 1-day expected return, we should expect to get A’ rather than A.
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)