Yonadav Shavit comments on Are limited-horizon agents a good heuristic for the off-switch problem?

Yonadav Shavit 8 Jan 2022 15:25 UTC
1 point
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
- redbird 8 Jan 2022 15:59 UTC
  1 point
  Parent
  Consider two possible agents A and A’.
  A optimizes for 1-day expected return.
  A’ optimizes for 10-day expected return under the assumption that a new copy of A’ will be instantiated each day.
  I claim that A’ will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).
  So even if we’re training the agent by rewarding it for 1-day expected return, we should expect to get A’ rather than A.
  - Yonadav Shavit 9 Jan 2022 19:17 UTC
    1 point
    Parent
    A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
    
    If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
    - redbird 10 Jan 2022 0:58 UTC
      1 point
      Parent
      Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
      My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
      (You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)