A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code:
A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code: