Can you say more about Alex Turner’s formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
Sure, I can say more about Alex Turner’s formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.
Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren’t like this.
I encourage you to read the post and/or paper; it’s quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming “resources” exist, whatever that means, resource acquisition is explained as a special case of power-seeking.
ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don’t exist, because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks “power” (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can’t make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren’t conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument “expected utility/reward maximization + X implies catastrophe”), namely X = “the reward function is typical”. Does that sound right?
Writing this comment reminded me of Oliver’s comment where X = “agent wasn’t specifically optimized away from goal-directedness”.
because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence;
That’s right; that would prove too much.
namely X = “the reward function is typical”. Does that sound right?
Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there’s wiggle room to say “but the reward functions we provide aren’t drawn from this distribution!”. I personally think this doesn’t matter much, because the work still tells us a lot about the underlying optimization pressures.
The result is also true in the general case of an arbitrary reward function distribution, you just don’t know in advance which terminal states the distribution prefers.
Can you say more about Alex Turner’s formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
Sure, I can say more about Alex Turner’s formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.
Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren’t like this.
I encourage you to read the post and/or paper; it’s quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming “resources” exist, whatever that means, resource acquisition is explained as a special case of power-seeking.
ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don’t exist, because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks “power” (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can’t make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren’t conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument “expected utility/reward maximization + X implies catastrophe”), namely X = “the reward function is typical”. Does that sound right?
Writing this comment reminded me of Oliver’s comment where X = “agent wasn’t specifically optimized away from goal-directedness”.
That’s right; that would prove too much.
Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there’s wiggle room to say “but the reward functions we provide aren’t drawn from this distribution!”. I personally think this doesn’t matter much, because the work still tells us a lot about the underlying optimization pressures.
The result is also true in the general case of an arbitrary reward function distribution, you just don’t know in advance which terminal states the distribution prefers.