Note 2: I think that ‘robust instrumentality’ is a more apt name for ‘instrumental convergence.’ That said, for backwards compatibility, this comment often uses the latter.
In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence.
Somehow.
I needed to find the right definitions first, and I couldn’t even imagine what the final theorems would say. The fall crept up on me… and found my work incomplete.
Let me tell you: if there’s ever been a time when I wished I’d been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it.
The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing ‘Terminator’, a byline dismissive of AI risk:
I wanted to reach out, to say, “hey, here’s a paper formalizing the question you’re all confused by!” But it was too early.
Now, at least, I can say what I wanted to say back then:
This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review’s object-level content is in that document, by the way. Feel free to add comments of your own.)
This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I’ve become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding “instrumental convergence” and “power-seeking.” I think that this debate suffers for lack of formal grounding, and I wouldn’t dream of introducing someone to these concepts via this debate.
While the debate is clearly historically important, I don’t think it belongs in the LessWrong review. I don’t think people significantly changed their minds, I don’t think that the debate was particularly illuminating, and I don’t think it contains the philosophical insight I would expect from a LessWrong review-level essay.
May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don’t think the discussion stands great on its own, but it may be helpful for:
people familiar with AI alignment who want to better understand some human factors behind ‘the field isn’t coordinating or converging on safety’.
people new to AI alignment who want to use the views of leaders in the field to help them orient.
I certainly agree with Rob’s first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019.
However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model).
Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:
Yann LeCun: … instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.
I’m glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.
Yann LeCun: … instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.
What’s your specific critique of this? I think it’s an interesting and insightful point.
LeCun claims too much. It’s true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don’t prioritize power-seeking. It’s true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn’t show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.
One reading of this “drives of behavior” claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired) objective. I assume that LeCun is instead discussing “all else equal, will statistical instrumental tendencies (‘instrumental convergence’) be more predictive of AI behavior than its specific objective function?”.
But “instrumental subgoals are much weaker drives of behavior than hardwired objectives” is not the only possible explanation of “the lack of domination behavior in non-social animals”! Maybe the orangutans aren’t robust to scale. Maybe orangutans do implement non power-seeking cognition, but maybe their cognitive architecture will be hard or unlikely for us to reproduce in a machine—maybe the distribution of TAI cognitive architectures we should expect, is far different from what orangutans are like.
I do agree that there’s a very good point in the neighborhood of the quoted argument. My steelman of this would be:
Some animals, like humans, seem to have power-seeking drives. Other animals, like orangutans, do not. Therefore, it’s possible to design agents of some intelligence which do not seek power. Obviously, we will be trying to design agents which do not seek power. Why, then, should we expect such agents to be more like humans than like orangutans?
(This is loose for a different reason, in that it presupooses a single relevant axis of variation between humans and orangutans. Is a personal computer more like a human, or more like an orangutan? But set that aside for the moment.)
I think he’s overselling the evidence. However, on reflection, I wouldn’t pick out the point for such strong ridicule.
I feel like you can turn this point upside down. Even among primates that seem unusually docile, like orang utans, male-male competition can get violent and occasionally ends in death. Isn’t that evidence that power-seeking is hard to weed out? And why wouldn’t it be in an evolved species that isn’t eusocial or otherwise genetically weird?
Note 1: This review is also a top-level post.
Note 2: I think that ‘robust instrumentality’ is a more apt name for ‘instrumental convergence.’ That said, for backwards compatibility, this comment often uses the latter.
In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence.
Somehow.
I needed to find the right definitions first, and I couldn’t even imagine what the final theorems would say. The fall crept up on me… and found my work incomplete.
Let me tell you: if there’s ever been a time when I wished I’d been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it.
The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing ‘Terminator’, a byline dismissive of AI risk:
I wanted to reach out, to say, “hey, here’s a paper formalizing the question you’re all confused by!” But it was too early.
Now, at least, I can say what I wanted to say back then:
This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review’s object-level content is in that document, by the way. Feel free to add comments of your own.)
This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I’ve become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding “instrumental convergence” and “power-seeking.” I think that this debate suffers for lack of formal grounding, and I wouldn’t dream of introducing someone to these concepts via this debate.
While the debate is clearly historically important, I don’t think it belongs in the LessWrong review. I don’t think people significantly changed their minds, I don’t think that the debate was particularly illuminating, and I don’t think it contains the philosophical insight I would expect from a LessWrong review-level essay.
Rob Bensinger’s nomination reads:
I certainly agree with Rob’s first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019.
However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model).
Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:
I’m glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.
What’s your specific critique of this? I think it’s an interesting and insightful point.
LeCun claims too much. It’s true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don’t prioritize power-seeking. It’s true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn’t show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.
One reading of this “drives of behavior” claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired) objective. I assume that LeCun is instead discussing “all else equal, will statistical instrumental tendencies (‘instrumental convergence’) be more predictive of AI behavior than its specific objective function?”.
But “instrumental subgoals are much weaker drives of behavior than hardwired objectives” is not the only possible explanation of “the lack of domination behavior in non-social animals”! Maybe the orangutans aren’t robust to scale. Maybe orangutans do implement non power-seeking cognition, but maybe their cognitive architecture will be hard or unlikely for us to reproduce in a machine—maybe the distribution of TAI cognitive architectures we should expect, is far different from what orangutans are like.
I do agree that there’s a very good point in the neighborhood of the quoted argument. My steelman of this would be:
(This is loose for a different reason, in that it presupooses a single relevant axis of variation between humans and orangutans. Is a personal computer more like a human, or more like an orangutan? But set that aside for the moment.)
I think he’s overselling the evidence. However, on reflection, I wouldn’t pick out the point for such strong ridicule.
I feel like you can turn this point upside down. Even among primates that seem unusually docile, like orang utans, male-male competition can get violent and occasionally ends in death. Isn’t that evidence that power-seeking is hard to weed out? And why wouldn’t it be in an evolved species that isn’t eusocial or otherwise genetically weird?