Excellent first point. I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
I do need to answer that question using in a goal-oriented search process. But my goal would be “answer Paul’s question”, not “destroy the world”. Maybe a different type of system could do it with no goal whatsoever, but that’s not clear.
But I’m puzzled by your statement
a system may not even be able to “want” things in the behaviorist sense
Perhaps you mean LLMs/predictive foundation models?
I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.
Nate’s answer to nearly all questions of the form “can you do X without wanting Y?” is “yes”, hence his second claim in the OP: “the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular”.
I do need to answer that question using in a goal-oriented search process. But my goal would be “answer Paul’s question”, not “destroy the world”.
Your ultimate goal would be neither of those things; you’re a human, and if you’re answering Paul’s question it’s probably because you have other goals that are served by answering.
In the same way, an AI that’s sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it’s unlikely by default that “answer questions” will be the AI’s primary goal.
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent’s process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there’s pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there’s a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don’t know if I’m reproducing opposing arguments faithfully and part of the reason I’m trying is to see if someone can correct/improve on them.)
Thinking about it a little more, there may be a good reason to consider how humans pursue mid-horizon goals.
I think I do make a goal of answering Paul’s question. It’s not a subgoal of my primary values of getting food, status, etc, because backward-chaining is too complex. It’s based on a vague estimate of the value (total future reward) of that action in context. I wrote about this in Human preferences as RL critic values—implications for alignment, but I’m not sure how clear that brief post was.
I was addressing a different part of Paul’s comment than the original question. I mentioned that I didn’t have an answer to the question of whether one can make long-range plans without wanting anything. I did try an answer in a separate top-level response:
it doesn’t matter much whether a system can pursue long-horizon tasks without wanting, because agency is useful for long-horizon tasks, and it’s not terribly complicated to implement. So AGI will likely have it built in, whether or not it would emerge from adequate non-agentic training. I think people will rapidly agentize any oracle system. It’s useful to have a system that does things for you. And to do anything more complicated than answer one email, the user will be giving it a goal that may include instrumental subgoals.
The possibility of emergent wanting might still be important in an agent scaffolded around a foundation model.
Perhaps I’m confused about the scenarios you’re considering here. I’m less worried about LLMs achieving AGI and developing emergent agency, because we’ll probably give them agency before that happens.
You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.
A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.
Excellent first point. I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
I do need to answer that question using in a goal-oriented search process. But my goal would be “answer Paul’s question”, not “destroy the world”. Maybe a different type of system could do it with no goal whatsoever, but that’s not clear.
But I’m puzzled by your statement
Perhaps you mean LLMs/predictive foundation models?
You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.
Nate’s answer to nearly all questions of the form “can you do X without wanting Y?” is “yes”, hence his second claim in the OP: “the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular”.
Your ultimate goal would be neither of those things; you’re a human, and if you’re answering Paul’s question it’s probably because you have other goals that are served by answering.
In the same way, an AI that’s sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it’s unlikely by default that “answer questions” will be the AI’s primary goal.
When the post says:
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent’s process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there’s pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there’s a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don’t know if I’m reproducing opposing arguments faithfully and part of the reason I’m trying is to see if someone can correct/improve on them.)
Thinking about it a little more, there may be a good reason to consider how humans pursue mid-horizon goals.
I think I do make a goal of answering Paul’s question. It’s not a subgoal of my primary values of getting food, status, etc, because backward-chaining is too complex. It’s based on a vague estimate of the value (total future reward) of that action in context. I wrote about this in Human preferences as RL critic values—implications for alignment, but I’m not sure how clear that brief post was.
I was addressing a different part of Paul’s comment than the original question. I mentioned that I didn’t have an answer to the question of whether one can make long-range plans without wanting anything. I did try an answer in a separate top-level response:
it doesn’t matter much whether a system can pursue long-horizon tasks without wanting, because agency is useful for long-horizon tasks, and it’s not terribly complicated to implement. So AGI will likely have it built in, whether or not it would emerge from adequate non-agentic training. I think people will rapidly agentize any oracle system. It’s useful to have a system that does things for you. And to do anything more complicated than answer one email, the user will be giving it a goal that may include instrumental subgoals.
The possibility of emergent wanting might still be important in an agent scaffolded around a foundation model.
Perhaps I’m confused about the scenarios you’re considering here. I’m less worried about LLMs achieving AGI and developing emergent agency, because we’ll probably give them agency before that happens.
A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.