I like the approach. Here is where I got applying it to our scenario:
m is a policy for day trading
L(m) is expected 1-day return
D(m) is the “trading environment” produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
mn+1=argminmL(m;mn,mn−1,…).
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense?
L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…)
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
The initial part all looks correct. However, something got lost here:
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Because it’s true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won’t have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗.
On the other hand, a modified version of the long-term trader which sets up “traps” that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
It’s true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, “To what degree can someone exploit your environment better than you can?”.
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
If you’re limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.
Not really, due to the myopia of the situation. I think this may provide a better approach for reasoning about the behavior of myopic optimization.
I like the approach. Here is where I got applying it to our scenario:
m is a policy for day trading
L(m) is expected 1-day return
D(m) is the “trading environment” produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
mn+1=argminmL(m;mn,mn−1,…).
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense?
L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…)
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
The initial part all looks correct. However, something got lost here:
Because it’s true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won’t have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗.
On the other hand, a modified version of the long-term trader which sets up “traps” that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
It’s true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, “To what degree can someone exploit your environment better than you can?”.
If you’re limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.