Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?”, so, here we are.
Okay, so you know how AI today isn’t great at certain… let’s say “long-horizon” tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)
And you know how the AI doesn’t seem to have all that much “want”- or “desire”-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI’s ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you’re check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise. Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one’s plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
I make no claim about the AI’s internal states and whether those bear any resemblance to the internal state of a human consumed by the feeling of desire. To paraphrase something Eliezer Yudkowsky said somewhere: we wouldn’t say that a blender “wants” to blend apples. But if the blender somehow managed to spit out oranges, crawl to the pantry, load itself full of apples, and plug itself into an outlet, then we might indeed want to start talking about it as though it has goals, even if we aren’t trying to make a strong claim about the internal mechanisms causing this behavior.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
Why might we see this sort of “wanting” arise in tandem with the ability to solve long-horizon problems and perform long-horizon tasks?
Because these “long-horizon” tasks involve maneuvering the complicated real world into particular tricky outcome-states, despite whatever surprises and unknown-unknowns and obstacles it encounters along the way. Succeeding at such problems just seems pretty likely to involve skill at figuring out what the world is, figuring out how to navigate it, and figuring out how to surmount obstacles and then reorient in some stable direction.
(If each new obstacle causes you to wander off towards some different target, then you won’t reliably be able to hit targets that you start out aimed towards.)
If you’re the sort of thing that skillfully generates and enacts long-term plans, and you’re the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
(Indeed, this connection feels almost tautological to me, such that it feels odd to talk about these as distinct properties of an AI. “Does it act as though it wants things?” isn’t an all-or-nothing question, and an AI can be partly goal-oriented without being maximally goal-oriented. But the more the AI’s performance rests on its ability to make long-term plans and revise those plans in the face of unexpected obstacles/opportunities, the more consistently it will tend to steer the things it’s interacting with into specific states—at least, insofar as it works at all.)
The ability to keep reorienting towards some target seems like a pretty big piece of the puzzle of navigating a large and complex world to achieve difficult outcomes.
And this intuition is backed up by the case of humans: it’s no mistake that humans wound up having wants and desires and goals—goals that they keep finding clever new ways to pursue even as reality throws various curveballs at them, like “that prey animal has been hunted to extinction”.
These wants and desires and goals weren’t some act of a god bequeathing souls into us; this wasn’t some weird happenstance; having targets like “eat a good meal” or “impress your friends” that you reorient towards despite obstacles is a pretty fundamental piece of being able to eat a good meal or impress your friends. So it’s no surprise that evolution stumbled upon that method, in our case.
(The implementation specifics in the human brain—e.g., the details of our emotional makeup—seem to me like they’re probably fiddly details that won’t recur in an AI that has behaviorist “desires”. But the overall “to hit a target, keep targeting it even as you encounter obstacles” thing seems pretty central.)
The above text vaguely argues that doing well on tough long-horizon problems requires pursuing an abstract target in the face of a wide array of real-world obstacles, which involves doing something that looks from the outside like “wanting stuff”. I’ll now make a second claim (supported here by even less argument): that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
For instance, humans find themselves wanting things like good meals and warm nights and friends who admire them. And all those wants added up in the ancestral environment to high inclusive genetic fitness. Observing early hominids from the outside, aliens might have said that the humans are “acting as though they want to maximize their inclusive genetic fitness”; when humans then turn around and invent birth control, it’s revealed that they were never actually steering the environment toward that goal in particular, and instead had a messier suite of goals that correlated with inclusive genetic fitness, in the environment of evolutionary adaptedness, at that ancestral level of capability.
Which is to say, my theory says “AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
A follow-on inference from the above point is: when the AI leaves training, and it’s tasked with solving bigger and harder long-horizon problems in cases where it has to grow smarter than ever before and develop new tools to solve new problems, and you realize finally that it’s pursuing neither the targets you trained it to pursue nor the targets you asked it to pursue—well, by that point, you’ve built a generalized obstacle-surmounting engine. You’ve built a thing that excels at noticing when a wrench has been thrown in its plans, and at understanding the wrench, and at removing the wrench or finding some other way to proceed with its plans.
And when you protest and try to shut it down—well, that’s just another obstacle, and you’re just another wrench.
So, maybe don’t make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
- And All the Shoggoths Merely Players by 10 Feb 2024 19:56 UTC; 170 points) (
- Counting arguments provide no evidence for AI doom by 27 Feb 2024 23:03 UTC; 100 points) (
- Voting Results for the 2023 Review by 6 Feb 2025 8:00 UTC; 86 points) (
- Counting arguments provide no evidence for AI doom by 27 Feb 2024 23:03 UTC; 84 points) (EA Forum;
- On Anthropic’s Sleeper Agents Paper by 17 Jan 2024 16:10 UTC; 54 points) (
- AI #50: The Most Dangerous Thing by 8 Feb 2024 14:30 UTC; 53 points) (
- AI #40: A Vision from Vitalik by 30 Nov 2023 17:30 UTC; 53 points) (
- 13 Jan 2024 15:42 UTC; 26 points) 's comment on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by (
- 15 Dec 2023 16:56 UTC; 10 points) 's comment on “AI Alignment” is a Dangerously Overloaded Term by (
- 12 Feb 2024 3:09 UTC; 8 points) 's comment on And All the Shoggoths Merely Players by (
- 3 Dec 2023 5:16 UTC; 5 points) 's comment on Quick takes on “AI is easy to control” by (
- 17 Dec 2023 17:11 UTC; 2 points) 's comment on Current AIs Provide Nearly No Data Relevant to AGI Alignment by (
- 20 Feb 2024 4:07 UTC; 1 point) 's comment on And All the Shoggoths Merely Players by (
The main insight of the post (as I understand it) is this:
In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don’t seem like they’re trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
In the context of a discussion among tech people and VCs about how we haven’t yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don’t seem like they’re trying hard to autonomously accomplish long-horizon goals—alas, let’s try to fix that problem.”
One sounds good and the other sounds bad, but there’s a duality connecting them. They’re the same observation. You can’t get one without the other.
This is an important insight because it helps us recognize the fact that people are trying to solve the second-bullet-point problem (and making nonzero progress), and to the extent that they succeed, they’ll make things worse from the perspective of the people in the first bullet point.
This insight is not remotely novel! (And OP doesn’t claim otherwise.) …But that’s fine, nothing wrong with saying things that many readers will find obvious.
(This “duality” thing is a useful formula! Another related example that I often bring up is the duality between positive-coded “the AI is able to come up with out-of-the-box solutions to problems” versus the negative-coded “the AI sometimes engages in reward hacking”. I think another duality connects positive-coded “it avoids catastrophic forgetting” to negative-coded “it’s hard to train away scheming”, at least in certain scenarios.)
(…and as comedian Mitch Hedberg sagely noted, there’s a duality between positive-coded “cheese shredder” and negative-coded “sponge ruiner”.)
The post also chats about two other (equally “obvious”) topics:
Instrumental convergence: “the AI seems like it’s trying hard to autonomously accomplish long-horizon goals” involves the AI routing around obstacles, and one might expect that to generalize to “obstacles” like programmers trying to shut it down
Goal (mis)generalization: If “the AI seems like it’s trying hard to autonomously accomplish long-horizon goal X”, then the AI might actually “want” some different Y which partly overlaps with X, or is downstream from X, etc.
But the question on everyone’s mind is: Are we doomed?
In and of itself, nothing in this post proves that we’re doomed. I don’t think OP ever explicitly claimed it did? In my opinion, there’s nothing in this post that should constitute an update for the many readers who are already familiar with instrumental convergence, and goal misgeneralization, and the fact that people are trying to build autonomous agents. But OP at least gives a vibe of being an argument for doom going beyond those things, which I think was confusing people in the comments.
Why aren’t we necessarily doomed? Now this is my opinion, not OP’s, but here are three pretty-well-known outs (at least in principle):
The AI can “want” to autonomously accomplish a long-horizon goal, but also simultaneously “want” to act with integrity, helpfulness, etc. Just like it’s possible for humans to do. And if the latter “want” is strong enough, it can outvote the former “want” in cases where they conflict. See my post Consequentialism & corrigibility.
The AI can behaviorist-“want” to autonomously accomplish a long-horizon goal, but where the “want” is internally built in such a way as to not generalize OOD to make treacherous turns seem good to the AI. See e.g. my post Thoughts on “Process-Based Supervision”, which is skeptical about the practicalities, but I think the idea is sound in principle.
We can in principle simply avoid building AIs that autonomously accomplish long-horizon goals, notwithstanding the economic and other pressures—for example, by keeping humans in the loop (e.g. oracle AIs). This one came up multiple times in the comments section.
There’s plenty of challenges in these approaches, and interesting discussions to be had, but the post doesn’t engage with any of these topics.
Anyway, I’m voting strongly against including this post in the 2023 review. It’s not crisp about what it’s arguing for and against (and many commenters seem to have gotten the wrong idea about what it’s arguing for), it’s saying obvious things in a meandering way, and it’s not refuting or even mentioning any of the real counterarguments / reasons for hope. It’s not “best of” material.