I think there’s an important thing to note, if it doesn’t already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.
Here’s my attempt at a poor intuitionistic proof:
If you have some kind of program that understands consequences or backchains or etc, then perhaps it’s capable of recognizing that “acquire lots of power” will then let it choose from a much larger set of possibilities. Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I’m worried about “instrumental convergence”.
---
At this point, I’m already much more worried about instrumental convergence, because backchaining feels damn useful. It’s the sort of thing I’d expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like “utility function over here” and “maximizer over there”.
(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can’t literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)
---
Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.
Understanding consequences of actions is a reasonable requirement for a machine to be called intelligent, however that implies nothing about the behavior of this machine. A paperclip maker may understand that destroying the earth could yield paperclips, it may not care much for humans, and it may still not do it. There is nothing inconsistent or unlikely about such machine. You’re thinking of machines that have a pipeline: pick a goal → find the optimal plan to reach the goal → implement the plan. However this is not the only possible architecture (though it is appealing).
More generally, intelligence is the ability to solve hard problems. If a program solves a problem without explicitly making predictions about the future, that’s fine. Though I don’t want to make claims about how common such programs would be. And there is also a problem of recognizing whether a program not written by a human does explicitly consider cause and effect.
Hm, I think an important piece of “intuitionistic proof” didn’t transfer, or is broken. Drawing attention to that part:
Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power.
So here, I realize, I am relying on something like “the AI implicitly moves toward an imagined realizable future”. I think that’s a lot easier to get than the pipeline you sketch.
I think I’m being pretty unclear—I’m having trouble conveying my thought structure here. I’ll go make a meta-level comment instead.
As I understand, your argument is that there are many dangerous world-states and few safe world-states, therefore most powerful agents would move to a dangerous state, in the spirit of entropy. This seems reasonable.
An alarming version of this argument assumes that the agents already have power, however I think that they don’t and that acquiring dangerous amounts of power is hard work and won’t happen by accident.
A milder version of the same argument says that even relatively powerless, unaligned agents would slowly and unknowingly inch towards a more dangerous future world-state. This is probably true, however, if humans retain some control, this is probably harmless. And it is also debatable to what extent that sort of probabilistic argument can work on a complex machine.
Though I don’t want to make claims about how common such programs would be.
If you don’t want to make claims about how common such programs are, how do you defend the (implicit) assertion that such programs are worth talking about, especially in the context of the alignment problem?
I don’t want to make claims about how many random programs make explicit predictions about the future to reach their goals. For all I know it could be 1% and it could be 99%. However, I do make claims about how common other kinds of programs are. I claim that a given random program, regardless of whether it explicitly predicts the future, is unlikely to have the kind of motivational structure that would exhibit instrumental convergence.
Incidentally, I’m also interested in what specifically you mean by “random program”. A natural interpretation is that you’re talking about a program that is drawn from some kind of distribution across the set of all possible programs, but as far as I can tell, you haven’t actually defined said distribution. Without a specific distribution to talk about, any claim about how likely a “random program” is to do anything is largely meaningless, since for any such claim, you can construct a distribution that makes that claim true.
(Note: The above paragraph was originally a parenthetical note on my other reply, but I decided to expand it into its own, separate comment, since in my experience having multiple unrelated discussions in a single comment chain often leads to unproductive conversation.)
Well, good question. Frankly I don’t think it matters. I don’t believe that my claims are sensitive to the distributions (aside from some convoluted ones), or that giving you a specific distribution would help you to defend either position (feel free to prove me wrong). But when I want to feel rigorous, I assume that I’m starting off with a natural length-based distribution over all Turing machines (or maybe all neural networks), then discard all machines that fail to pass some relatively simple criteria about the output they generate (e.g. does it classify a given set of cat pictures correctly), keep the ones that passed, normalize and draw from that.
But really, by “random” I mean nearly anything that’s not entirely intentional. To use a metaphor for machine learning, if you pick a random point in the world map, then find the nearest point that’s 2km above sea level, you’ll find a “random” point that’s 2km above sea level. The algorithm has a non-random step, but the outcome is clearly random in a significant way. The distribution you get is different from the one I described in my previous paragraph (where you just filtered the initial point distribution to get the points at 2km), but they’ll most likely be close.
I claim that a given random program, regardless of whether it explicitly predicts the future, is unlikely to have the kind of motivational structure that would exhibit instrumental convergence.
Yes, I understand that. What I’m more interested in knowing, however, is how this statement connects to AI alignment in your view, since any AI created in the real world will certainly not be “random”.
I think there’s an important thing to note, if it doesn’t already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.
Here’s my attempt at a poor intuitionistic proof:
If you have some kind of program that understands consequences or backchains or etc, then perhaps it’s capable of recognizing that “acquire lots of power” will then let it choose from a much larger set of possibilities. Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I’m worried about “instrumental convergence”.
---
At this point, I’m already much more worried about instrumental convergence, because backchaining feels damn useful. It’s the sort of thing I’d expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like “utility function over here” and “maximizer over there”.
(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can’t literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)
---
Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.
Agreed. I guess instrumental convergence mostly applies to AIs that we have to worry about, not all possible minds.
Understanding consequences of actions is a reasonable requirement for a machine to be called intelligent, however that implies nothing about the behavior of this machine. A paperclip maker may understand that destroying the earth could yield paperclips, it may not care much for humans, and it may still not do it. There is nothing inconsistent or unlikely about such machine. You’re thinking of machines that have a pipeline: pick a goal → find the optimal plan to reach the goal → implement the plan. However this is not the only possible architecture (though it is appealing).
More generally, intelligence is the ability to solve hard problems. If a program solves a problem without explicitly making predictions about the future, that’s fine. Though I don’t want to make claims about how common such programs would be. And there is also a problem of recognizing whether a program not written by a human does explicitly consider cause and effect.
Hm, I think an important piece of “intuitionistic proof” didn’t transfer, or is broken. Drawing attention to that part:
So here, I realize, I am relying on something like “the AI implicitly moves toward an imagined realizable future”. I think that’s a lot easier to get than the pipeline you sketch.
I think I’m being pretty unclear—I’m having trouble conveying my thought structure here. I’ll go make a meta-level comment instead.
As I understand, your argument is that there are many dangerous world-states and few safe world-states, therefore most powerful agents would move to a dangerous state, in the spirit of entropy. This seems reasonable.
An alarming version of this argument assumes that the agents already have power, however I think that they don’t and that acquiring dangerous amounts of power is hard work and won’t happen by accident.
A milder version of the same argument says that even relatively powerless, unaligned agents would slowly and unknowingly inch towards a more dangerous future world-state. This is probably true, however, if humans retain some control, this is probably harmless. And it is also debatable to what extent that sort of probabilistic argument can work on a complex machine.
If you don’t want to make claims about how common such programs are, how do you defend the (implicit) assertion that such programs are worth talking about, especially in the context of the alignment problem?
I don’t want to make claims about how many random programs make explicit predictions about the future to reach their goals. For all I know it could be 1% and it could be 99%. However, I do make claims about how common other kinds of programs are. I claim that a given random program, regardless of whether it explicitly predicts the future, is unlikely to have the kind of motivational structure that would exhibit instrumental convergence.
Incidentally, I’m also interested in what specifically you mean by “random program”. A natural interpretation is that you’re talking about a program that is drawn from some kind of distribution across the set of all possible programs, but as far as I can tell, you haven’t actually defined said distribution. Without a specific distribution to talk about, any claim about how likely a “random program” is to do anything is largely meaningless, since for any such claim, you can construct a distribution that makes that claim true.
(Note: The above paragraph was originally a parenthetical note on my other reply, but I decided to expand it into its own, separate comment, since in my experience having multiple unrelated discussions in a single comment chain often leads to unproductive conversation.)
Well, good question. Frankly I don’t think it matters. I don’t believe that my claims are sensitive to the distributions (aside from some convoluted ones), or that giving you a specific distribution would help you to defend either position (feel free to prove me wrong). But when I want to feel rigorous, I assume that I’m starting off with a natural length-based distribution over all Turing machines (or maybe all neural networks), then discard all machines that fail to pass some relatively simple criteria about the output they generate (e.g. does it classify a given set of cat pictures correctly), keep the ones that passed, normalize and draw from that.
But really, by “random” I mean nearly anything that’s not entirely intentional. To use a metaphor for machine learning, if you pick a random point in the world map, then find the nearest point that’s 2km above sea level, you’ll find a “random” point that’s 2km above sea level. The algorithm has a non-random step, but the outcome is clearly random in a significant way. The distribution you get is different from the one I described in my previous paragraph (where you just filtered the initial point distribution to get the points at 2km), but they’ll most likely be close.
Maybe that answers your other comment too?
Yes, I understand that. What I’m more interested in knowing, however, is how this statement connects to AI alignment in your view, since any AI created in the real world will certainly not be “random”.