Nice post! I think this notion of time-inconsistency points to a key problem in impact measurement, and if we could solve it (without backtracking on other problems, like interference/offsetting), we would be a lot closer to dealing with subagent issues.
I think the other baselines can also induce time-inconsistent behavior, for the same reason: if reaching the main goal has a side effect of allowing the agent to better achieve the auxiliary goal (compared to starting state / inaction / stepwise inaction), the agent is willing to pay a small amount to restrict its later capabilities. Sometimes this is even a good thing—the agent might “pay” by increasing its power in a very specialized and narrow manner, instead of gaining power in general, and we want that.
Here are some technical quibbles which don’t affect the conclusion (yay).
If using an inaction rollout of length l, just multiply that penalty by γl
I don’t think so—the inaction rollout formulation (as I think of it) compares the optimal value after taking action a and waiting for N−1 steps, with the optimal value after N steps of waiting. There’s no additional discount there.
Fortunately, when summing up the penalties, you sum terms like …p|γn−1−γn|+p|γn−γn+1|…, so a lot of the terms cancel.
Nice post! I think this notion of time-inconsistency points to a key problem in impact measurement, and if we could solve it (without backtracking on other problems, like interference/offsetting), we would be a lot closer to dealing with subagent issues.
I think the other baselines can also induce time-inconsistent behavior, for the same reason: if reaching the main goal has a side effect of allowing the agent to better achieve the auxiliary goal (compared to starting state / inaction / stepwise inaction), the agent is willing to pay a small amount to restrict its later capabilities. Sometimes this is even a good thing—the agent might “pay” by increasing its power in a very specialized and narrow manner, instead of gaining power in general, and we want that.
Here are some technical quibbles which don’t affect the conclusion (yay).
I don’t think so—the inaction rollout formulation (as I think of it) compares the optimal value after taking action a and waiting for N−1 steps, with the optimal value after N steps of waiting. There’s no additional discount there.
Why do the absolute values cancel?
Because γn>γn+1, so you can remove the absolute values.