This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent’s process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there’s pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there’s a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don’t know if I’m reproducing opposing arguments faithfully and part of the reason I’m trying is to see if someone can correct/improve on them.)
When the post says:
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent’s process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there’s pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there’s a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don’t know if I’m reproducing opposing arguments faithfully and part of the reason I’m trying is to see if someone can correct/improve on them.)