It seems a common reading of my results is that agents tend to seek out states with higher power. I think this is usually right, but it’s false in some cases. Here’s an excerpt from the paper:
So, just because a state has more resources, doesn’t technically mean the agent will go out of its way to reach it. Here’s what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to their instrumental convergence. As I said in the paper,
The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply “imagine″ a relevant MDP and then estimate e.g. the “power contributions″ of certain courses of action. However, any formal calculations of POWER are obviously infeasible for nontrivial environments.
To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like theorem 46, which reaffirms the common-sense prediction that more cycles means asymptotic instrumental convergence, or theorem 26, fully determining the power in time-uniform environments). We can reason, “for avoiding shutdown to not be heavily convergent, the model would have to look like such-and-such, but it almost certainly does not...″.
I think the Tic-Tac-Toe reasoning is a better intuition: it’s instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I’m working on expanding the formal results to include some version of this.
It seems a common reading of my results is that agents tend to seek out states with higher power. I think this is usually right, but it’s false in some cases. Here’s an excerpt from the paper:
So, just because a state has more resources, doesn’t technically mean the agent will go out of its way to reach it. Here’s what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to their instrumental convergence. As I said in the paper,
I think the Tic-Tac-Toe reasoning is a better intuition: it’s instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I’m working on expanding the formal results to include some version of this.