An idea about instrumental convergence for non-equilibrium RL algorithms.
There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.
If we have an asynchronous dynamic program operating on such an MDP, then it seems likely that there exists an exponent on that power law (perhaps we also need terms for the incoming degree distribution of the nodes) such that for all exponents greater than that, your dynamic program will find & keep a power-seeking policy before arriving at the optimal policy.
A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code:
Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one of many power nodes becomes easier on average than figuring out how to access the single goal node. The number & relatedness of rewarding nodes also seems relevant. If there are very few, then we expect finding a power node becomes easier than finding a reward node. If there are very many, and/or they each lead into each other, then your chances of finding a reward node increase, and given you find a reward node, your chances of finding more increase, so power is not so necessary.
An idea about instrumental convergence for non-equilibrium RL algorithms.
There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.
If we have an asynchronous dynamic program operating on such an MDP, then it seems likely that there exists an exponent on that power law (perhaps we also need terms for the incoming degree distribution of the nodes) such that for all exponents greater than that, your dynamic program will find & keep a power-seeking policy before arriving at the optimal policy.
A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code:
Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one of many power nodes becomes easier on average than figuring out how to access the single goal node. The number & relatedness of rewarding nodes also seems relevant. If there are very few, then we expect finding a power node becomes easier than finding a reward node. If there are very many, and/or they each lead into each other, then your chances of finding a reward node increase, and given you find a reward node, your chances of finding more increase, so power is not so necessary.