The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.
So a sequence of time-limited agents could still develop instrumental power-seeking. You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.
I like the approach. Here is where I got applying it to our scenario:
m is a policy for day trading
L(m) is expected 1-day return
D(m) is the “trading environment” produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
mn+1=argminmL(m;mn,mn−1,…).
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense?
L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…)
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
The initial part all looks correct. However, something got lost here:
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Because it’s true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won’t have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗.
On the other hand, a modified version of the long-term trader which sets up “traps” that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
It’s true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, “To what degree can someone exploit your environment better than you can?”.
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
If you’re limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up.
If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)
The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.
So a sequence of time-limited agents could still develop instrumental power-seeking. You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.
Not really, due to the myopia of the situation. I think this may provide a better approach for reasoning about the behavior of myopic optimization.
I like the approach. Here is where I got applying it to our scenario:
m is a policy for day trading
L(m) is expected 1-day return
D(m) is the “trading environment” produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
mn+1=argminmL(m;mn,mn−1,…).
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense?
L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…)
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
The initial part all looks correct. However, something got lost here:
Because it’s true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won’t have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗.
On the other hand, a modified version of the long-term trader which sets up “traps” that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
It’s true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, “To what degree can someone exploit your environment better than you can?”.
If you’re limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
Consider two possible agents A and A’.
A optimizes for 1-day expected return.
A’ optimizes for 10-day expected return under the assumption that a new copy of A’ will be instantiated each day.
I claim that A’ will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).
So even if we’re training the agent by rewarding it for 1-day expected return, we should expect to get A’ rather than A.
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)