One additional source that I found helpful to look at is the paper “Formalizing Convergent Instrumental Goals” by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro’s instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.
The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility (compared to the utility it could achieve if it didn’t make use of the additional resources). Therefore, any optimal strategy will make use of those additional resources (killing humans in the process). In the Bit Universe example given in the paper, if the agent doesn’t terminally care what happens in some particular region h (I guess they chose this letter because it’s supposed to represent where humans are), but h contains resources that can be burned to increase utility in other regions, the agent will burn those resources.
Both Rohin’s and Jessica’s twitching robot examples seem to violate these assumptions (if we were to translate them into the formalism used in the paper), because the robot cannot make use of additional resources to obtain a higher utility.
For me, the upshot of looking at this paper is something like:
MIRI people don’t seem to be arguing that expected utility maximization alone implies catastrophe.
There are some additional conditions that, when taken together with expected utility maximization, seem to give a pretty good argument for catastrophe.
These additional conditions don’t seem to have been argued for (or at least, this specific paper just assumes them).
Can you say more about Alex Turner’s formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
Sure, I can say more about Alex Turner’s formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.
Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren’t like this.
I encourage you to read the post and/or paper; it’s quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming “resources” exist, whatever that means, resource acquisition is explained as a special case of power-seeking.
ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don’t exist, because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks “power” (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can’t make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren’t conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument “expected utility/reward maximization + X implies catastrophe”), namely X = “the reward function is typical”. Does that sound right?
Writing this comment reminded me of Oliver’s comment where X = “agent wasn’t specifically optimized away from goal-directedness”.
because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence;
That’s right; that would prove too much.
namely X = “the reward function is typical”. Does that sound right?
Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there’s wiggle room to say “but the reward functions we provide aren’t drawn from this distribution!”. I personally think this doesn’t matter much, because the work still tells us a lot about the underlying optimization pressures.
The result is also true in the general case of an arbitrary reward function distribution, you just don’t know in advance which terminal states the distribution prefers.
One additional source that I found helpful to look at is the paper “Formalizing Convergent Instrumental Goals” by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro’s instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.
The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility (compared to the utility it could achieve if it didn’t make use of the additional resources). Therefore, any optimal strategy will make use of those additional resources (killing humans in the process). In the Bit Universe example given in the paper, if the agent doesn’t terminally care what happens in some particular region h (I guess they chose this letter because it’s supposed to represent where humans are), but h contains resources that can be burned to increase utility in other regions, the agent will burn those resources.
Both Rohin’s and Jessica’s twitching robot examples seem to violate these assumptions (if we were to translate them into the formalism used in the paper), because the robot cannot make use of additional resources to obtain a higher utility.
For me, the upshot of looking at this paper is something like:
MIRI people don’t seem to be arguing that expected utility maximization alone implies catastrophe.
There are some additional conditions that, when taken together with expected utility maximization, seem to give a pretty good argument for catastrophe.
These additional conditions don’t seem to have been argued for (or at least, this specific paper just assumes them).
See also Alex Turner’s work on formalizing instrumentally convergent goals, and his walkthrough of the MIRI paper.
Can you say more about Alex Turner’s formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
Sure, I can say more about Alex Turner’s formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.
Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren’t like this.
I encourage you to read the post and/or paper; it’s quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming “resources” exist, whatever that means, resource acquisition is explained as a special case of power-seeking.
ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don’t exist, because Alex’s paper doesn’t take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks “power” (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can’t make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren’t conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument “expected utility/reward maximization + X implies catastrophe”), namely X = “the reward function is typical”. Does that sound right?
Writing this comment reminded me of Oliver’s comment where X = “agent wasn’t specifically optimized away from goal-directedness”.
That’s right; that would prove too much.
Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there’s wiggle room to say “but the reward functions we provide aren’t drawn from this distribution!”. I personally think this doesn’t matter much, because the work still tells us a lot about the underlying optimization pressures.
The result is also true in the general case of an arbitrary reward function distribution, you just don’t know in advance which terminal states the distribution prefers.
Yeah, that upshot sounds pretty reasonable to me. (Though idk if it’s reasonable to think of that as endorsed by “all of MIRI”.)
Note that this requires the utility function to be completely indifferent to humans (or actively against them).