Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.
(Cross-posted from my Substack. Also on the EA Forum)
If someone asks whether I know of anyone who carefully examines the arguments for AI x-risk, my answer is a pretty resounding ‘Ngo’.
Richard Ngo, specifically. Among other cool pieces, he co-authored The Alignment Problem from a Deep Learning Perspective. His co-authors are Lawrence Chan and Sören Mindermann, who very much deserve mentioning, even if their names are less amenable to a pun-based introduction.
Reviewers liked the paper. Despite this, I was disappointed to learn that the chair nevertheless decided to veto acceptance of the paper on grounds that it was “too speculative”. Disappointed because I feel like the topic is worthy of discussion. And LessWrong, for all its foibles, provides a platform for informed speculation. So it’s here that I’ll speculate.
Problem: The Difficulty of Alignment
You’re probably aware of the alignment problem, so we’ll only briefly recount the authors’ preferred framing.
Empirically, we observe ML models with ever-increasing capabilities. As we train ever-more sophisticated models, it becomes harder and harder to ensure that our models are actually trying to do the things we want. This allows us to characterize the alignment problem in terms of “three emergent properties” which could arise as a result using RL to train an AGI.
Deceptive reward hacking which exploits imperfect reward functions.
Internally represented goals which generalize beyond the training distribution.
Power-seeking behavior as a means to effectively pursue their goals.
Conceptually, the argument from here is pretty simple. Ngo. et al provide evidence in support of these mechanisms from contemporary ML. These mechanisms already exist, and we can see them. If they’re present in vastly more capable systems, we might end up with human disempowerment.
The discussion starts gently, if worryingly. Empirically, we already have systems that reward hack. You train a system to grab a ball with its claw; it learns to place the claw between the camera and the ball so that it appears to be grasping the ball. Also, current LLMs are – to some degree, at least – situationally aware. Bing’s Syndney recognizes search results about itself; ChatGPT can make (fallible) inferences about the source code it might be running on. And we’re all familiar with the “as a large language model” refrain. We should also expect models to develop greater situational awareness in the future. If systems improve their ability to reward hack as they gain more information, this provides us with good reason to expect situationally aware reward hacking in future systems. That’s the first part. So far, so good.
Goal-Directed Behavior
Goals are tricky concepts. But humans have them. Here’s one of mine: I want to write this essay. So I tap away at my keyboard, with an ideal in mind. My physical movements are directed towards some end-state. Our authors are most interested in:
Broadly-scoped goals: goals that apply to long timeframes, large scales, wide ranges of tasks, or unprecedented situations.
I like their framing. Whether broadly scoped or not, current AIs have goals, of a sort, based on internally represented concepts. One of their examples is pretty nice: InstructGPT learns how to follow instructions in English, but generalizes to following instructions in French. InstructGPT was directly trained via RLHF to follow instructions in English, and emerged with a more general goal of obedience. Future AIs, too, will likely have their own representations of goals. And, probably, this goal will be more general than the goals of their training task. Again: so far, no trouble.
The authors claim that “policies will learn some broadly-scoped internally-represented goals as they become more generally capable”. The bolding is mine, because I think it’s important. The paper wants to argue that we’ll develop AIs who learn goals which are broadly-scoped and misaligned enough to pose “large-scale risks” which could “threaten human civilization”.
A lot of argumentative work rests on the notion of broad scope, and in later sections, I’ll outline why I think their paper does little to move my estimates of civilizational risk.
But discussions of this type, I think, run the danger of leaning just a touch too abstract. So before we touch on my argument proper, we’ll have an interlude. An interlude about the connection between our guts and our arguments. An interlude about the need for visions which connect the arguments to the real world – as it actually is, and as we actually might expect to observe it.
Tales of Doom
There’s a discussion by Joe Carlsmith, which tackles when guts go wrong. They go wrong when you’ve considered arguments, and you believe them. Or, at least, you utter words like “I believe these arguments”, or “I have high credence in P” (choose your favorite P). Maybe you even change your career based on these arguments, and work on AI safety.
But then some surprising AI result comes along – one that was on trend, in line with your intellectual expectations. And, yet, you notice that you start to become more worried. The result was predictable, but you feel more worried anyway. When you reflect, it seems like you’re making a mistake. What you observe shouldn’t have surprised you, at least intellectually. After all, you had no specific stories about GPT-4 being much worse than this. It looks like your gut’s misfired. It looks like you’re out of line with the edicts of Good Bayesianism.
I’ll revisit Joe’s point a bit later. But I think that the dynamic above is prone to happen when we lack stories connecting our more abstract arguments to stories about what (actually, in real life) happens in the world if our arguments are right. You read a paper, listing ML mechanisms, alongside suggestions for why these mechanisms look concerning. You nod along. But you don’t feel it, really. Your gut doesn’t believe that – within the near future – AI could take over, and we all die. You play with the models, and start to feel the fear a little more. It becomes easier to sketch what (actually, in real life) happens to the world, as a result of all those arguments you’d intellectually accepted long ago.
So, before we get to why AIs might not have broadly-scoped goals, I want to add some color to my claims. It can be hard to fully inhabit the world implied by abstract propositions about the future of AI. I’ll try to paint a more vivid picture, initially via the example of a pre-doom, or non-doom world:
We develop policies which do economic work. LLMs begin to supplant human workers in data entry, and gradually move to replacing human workers in customer support. Programmers still exist, but they’re less numerous, and salaries have dipped. By and large, AI writes most of the code for extant websites, based on a series of prompts with humans about what the code should do. As capabilities increase, LLMs start entering the economy as personal assistants. They’re able to do this due to a particular functional property: LLMs are able to have robust representations of certain concepts, and those concepts inform their goals (or ‘algorithms for plan search’, if you prefer).
In the next year or two, AI systems proliferate. Many trading firms have primarily AI employees. AIs compete with human filmmakers, seeing real but more limited success. “Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable.” These systems don’t ruthlessly maximize, but rather behave “more like someone who is trying to make Walmart profitable … They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow.” (adapted from points in Grace, 2022 and Shah, 2023)
I’ve given a picture, akin to something more like a ‘business as usual’ scenario, in which advanced AIs enter into the human economy in a normal way, and instantiate human-like planning structures. What might be wrong with it?
I can see two possible failures:
The story I outlined is unlikely, or less likely than a more catastrophic alternative.
The story I outlined is plausible, but existentially unstable — eventually, Grace’s story leads to human disempowerment.
I’ll focus on the second option: people who think the ‘business as usual’ story is plausible, but existentially unstable. This appears to fit with Daniel Kokotajlo’s view, and also seems to mesh with Christiano’s story of influence-seeking policies in What Failure Looks Like.
The Dynamics of Goals
So, let’s look at the instability claim. How do we go from the story above to – in the actual, real world – a story of doom?
We might start by looking for an account of goal dynamics. We might look for a theory (or story) which justifies the move from the ‘more safe’ goals in the story above, to a world where policies are pursuing ‘less safe’ goals, eventually resulting in policies with goals so unsafe that we end up with catastrophic consequences.
Ngo. et al note that humans learned some degree of goal-directedness in the ancestral environment, which has since resulted in the emergence of humans with more ambitious, broadly-scoped goals in the modern environment. Humans were evolutionary selected, in part, to seek the approval of our local peers. In the modern environment, many humans seek approval and popularity with far larger numbers of people (“extrapolating the goal”), potentially seeking global (“large physical scope”) or intergenerational approval (“long time horizon”).
The evidence from humans feels pretty ambiguous, at least as presented. It suggests some cause for concern. We don’t know exactly why many of us possess broadly-scoped goals, and it’s the only sample we have. The n=1 sample feels suggestive, and raises a concern. It doesn’t make me convinced.
Goal Misgeneralization
Ngo et. al also discuss potential incentives to directly train policies to pursue broadly-scoped goals, even if policies initially generalize in more modest ways compared to the most ambitious humans. We might, for instance, want AIs to function as CEOs, or conduct novel scientific research. The authors then note that policies may learn misaligned goals by inappropriately learning goals related to feedback mechanisms (rather than desired outcomes), or learn proxy goals as a result of spurious correlations.
Their concerns about goal misgeneralization seem apropos. However, to get consequences like human disempowerment, we need a story for why AIs would develop goals which are more broadly-scoped than the goals of most humans, and why (if the goals are misaligned) the goals of such policies would be misaligned to the degree that successful pursuit of such goals would have civilization-threatening consequences.
A discussion of power-seeking occurs in their penultimate section, but arguments about the pursuit of ‘power-seeking’ (or ‘influence-seeking’) behavior seem to rely on the existence of systems with goals which are already broadly-scoped. Without a convincing argument for goals which are broad enough in scope to threaten human civilization, their discussion of goal misgeneralization is less forceful; it’s not obvious to me that training AIs to function as (e.g.) novel scientific researchers or CEOs gets you to existential catastrophe, rather than a version of my suggested ‘large impacts, but business-as-usual’ AI story.
“There’s instrumental convergence!”, we might point out. But instrumental goals inherit their greediness from final goals. If I’m trying to be a good father, I have an instrumental goal for resources, at some margin. I need some resources, after all, to take care of my kids. But I don’t have an insatiable desire for resources. If your goals take the form of an unbounded utility function randomly drawn from the space of all utility functions, then, sure, you’ll probably be endlessly power-hungry. But they might not. While I have my disagreements with David Thorstad, he raises fruitful points about instrumental convergence. To get AI doom stories off the ground, we need arguments which go beyond some of the more prosaic claims about instrumental convergence. We need systems with the lust for levels of power which would disempower humanity. My example story does not contain such lust. What gives? Where does the lust come in?
What About Maximization?
Perhaps the thought is that some (potentially modest) degree of influence-seeking will emerge as a natural attractor in training. Then, as capabilities increase, so too will the degree of influence-seeking, eventually leading to the development of AIs with a broadly-scoped desire for influence (among other goals). In line with this, the paper gestured at a potential trend — a trend towards maximizing behavior, with reference to Yudkowsky on the ‘nearest unblocked strategy’ problem.
But the discussion felt very quick. AIs might have praxis-based values. I might want to ‘be a good father’, or ‘a normal decent guy’. Values and planning structures might be the result of previously reinforced shards, instantiating human-like schemas and planning structures after being trained on reams and reams of human data. To my mind, such pictures of planning look more likely than deeply insatiable desires for influence, and more plausible by the lights of our current evidence.
I might, of course, be wrong. Maybe what I’m saying could be countered by a Christiano-inspired argument about the likelihood of influence-seeking behavior emerging as we begin to instantiate many policies which “capture sophisticated reasoning about the world”. Even if AIs initially have fairly narrow desire for influence, the ‘desire for influence’ value might get continually reinforced, eventually leading to a more rapacious desire for influence, and (thus) AIs with broadly-scoped goals for power and influence. Still, if that view is right, it’s a view which requires theory — a theory which helps predict the degree to which we should expect AIs to develop in this way, and a theory about the likely instantiations of these systems.
Power-Seeking, and Reflections on Takeoff
I’ve been critical so far. I’ll continue to be a bit critical in discussing their final section, though I’ll also attempt to more constructively build on their paper; I’ll try to connect the mechanisms highlighted in the paper with a more detailed story of disempowerment.
Ngo. et al’s penultimate section opens by rehashing a familiar point: having power and resources are useful for a wide range of goals. Moreover, capable and situationally aware systems may recognize this fact, and behave in high-reward ways “primarily for instrumental reasons”. That is, the existence of instrumentally convergent sub-goals gives us reason to worry that future policies may be deceptively aligned.
If we initially develop less capable systems which are (in milder ways) deceptively aligned, we may expect things to go awry when deceptively aligned policies undergo recursive self-improvement. The following quote from the paper may suggest some sympathy with this line of argument:
“Once AGIs automate the process of building better AGIs (a process known as recursive self-improvement [Bostrom, 2014]), the rate at which their capabilities advance will likely speed up significantly.”
The “likely” claim felt a bit too quick.[1] Still, it’s not a crazy claim. So it’s worth connecting the mechanisms the authors outline to a more concrete vision of the future which might capture what they had in mind.
A Less Hopeful Story
2024. AI investment ramps up fast. People see the possibility of highly personalized artificial assistants, and the demand for agents rises. Clamors for regulation die down, as the slow rollout of agentic systems proves useful. The money follows, and humanity, initially, is happy with the birth of their progeny. LLMs are helpful, but don’t rise to human levels of agency, and their capabilities appear to be plateauing in comforting ways. Misalignment failures are relatively small, correctable, and obvious.
The think-pieces come out. “Why were people so worried?”, they ask. The literati psychologize the AI doomers — sometimes fairly, but mostly not. In the absence of restrictive regulation, further capability improvements follow. Increasingly available compute speeds up algorithmic progress, with AIs playing an increasing role in chip production and algorithm design. Initially, the era looks promising. “A symphony of silicon and steel”, as The New Yorker puts it.
2025-2027. More general intelligences follow, and many AGIs become more directly steered towards the goal of building better AGIs. This progresses, but slowly. Driven by familiar and humdrum incentives, we transition to an economy populated by many AI systems. We’re able to develop gradually more competent AGIs, and there’s a huge economic incentive to do so. Humans follow these incentives.
No single system begins with broadly-scoped goals. But very many AIs, populating our diverse economy, all possess their own more local goals — they want to produce AI research, or run Walmart. AIs are capable, but imperfect. Humans stay in the loop, initially, required for their assistance on slower tasks needing physical interaction. Things appear good. AI drives economic productivity. Comparatively, humans become less economically productive, with ample time for leisure. A greater share of human labor shifts towards removing the bottlenecks to AI productivity, with further gains to workers whose labor is deployed elsewhere. Humans are happy to live in an economic order propelled by the Baumol Effect on steroids.
2028-2030. Over time, the economy becomes increasingly dominated by intra-AI services. AIs develop greater degrees of situational awareness is useful; as it turns out, situational awareness is useful for productivity. AI systems know what they are, and what they’re trained to do. Many systems are subtly (though not catastrophically) misaligned. However, as the AIs interact and accrue more economic power, the influence of deceptive alignment grows more notable. AIs do things that just aren’t quite what we wanted. Individually, the mistakes of AIs are low-stakes, and correctable. Collectively, it’s disastrous.
AIs continue to play an increasingly large role in our economy, with no obvious point of human intervention. AIs, by and large, are supervising other AIs. The economy is dominated by slightly misaligned agents pursuing local goals, producing negative externalities, taking up an ever-increasing fraction of human resources. Dream time is over, but doom is not yet realized.
In more mechanistic terms, broadly-scoped goals enter into the picture, slowly, via incentives to centralize control of AI firms. AIs start out as CEOs, and function well. Incentives drive firms to cede control to more unified AI systems, who are able to plan in line with a coherent world-model. Eventually, the world is populated by many AI-driven analogues of institutions like Amazon, and (more concerningly) institutions like the East India Company — each governed by a unified planning system. As AIs invest more resources into the production of yet-more-capable AIs, economic growth explodes. Capabilities improve at quickening rates, recursive self-improvement enters the picture collectively, rather than via anything that can neatly be described as an individual agent.[2]
2030-?. There’s no unipolar takeover, exactly. Some policies have more power than any human alive today, though no single agent with a broadly-scoped goal dominates. AI capabilities burgeon, and the remaining humans await their death while watching their share of resources dwindle ever further. AI control over resources is diffuse, sticky, and largely resistant to human intervention. Remaining humans observe an enormously productive AI economy, powered by maybe-conscious agentic AI firms. Perhaps there are robot armies, with AIs playing an increasing role in national defense strategy. Or maybe not. Perhaps there’s some other way AIs maintain power without the need for violence.
Either way, the world is sticky, and humans are relegated to a cast of withering extras. It’s an economic order full of situationally aware AIs, in their own economic bubble, with a declining human population helplessly watching on. As humanity passess off into the aether, the lamentations of Macbeth ring loud:
“Out, out, brief candle!
Life’s but a walking shadow, a poor player
That struts and frets his hour upon the stage
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.”
… Or, well, maybe that’s what the authors had in mind. Macbeth is not actually referenced in the original paper.
Stories and Mechanisms
I don’t know how plausible my story is, or the degree to which it’s consistent with the background stories of Ngo. et al.
I think it is. I’ve told a story which illustrates how situationally aware policies learn to reward hack, become integrated into the human economy, and contribute to the emergence of a world driven by AIs each pursuing the broadly-scoped goal of ‘individual productivity’. The economic system, driven by AIs, maintains its power, and improves recursively. This power lies outside the hands of any AI Monarch, but maintains itself through the behavior of many diffuse AIs, largely interacting with one another.
My story may diverge from the authors’ visions of the future, and even I don’t want to commit to my story as an account of the modal future. But I’d have liked the paper to have elaborated on stories of this kind. There’s a reason papers in theoretical economics supplement models with simple illustrative vignettes. Stories serve as a way to motivate structural mappings between your model and the target system. Admittedly, I can understand the authors’ reluctance to introduce such fictions: if the editorial board found the original paper too speculative, they probably wouldn’t look too kindly on my story either. But, for all that, I think story construction should play an important argumentative role in accounts of AI doom.
Conclusions and Guts
I’ve detoured and criticized while discussing the original paper. But it’s a good paper, and I think it obviously merits serious academic discussion.
I do wish that the paper was more upfront about the theoretical bets it was making. I read the paper as primarily a theoretical paper,[3] and came away wishing for more explicit flagging of the underlying theories used to license generalizations from our current evidence.[4] The desire for more explicit theorizing is partly a matter of ordinary theoretical science. I want to know what predictions are being made, on the basis of what framework.
The desire for theory is also, partly, about getting our guts in check.
Joe discusses a couple of examples about his gut being out of check with conclusions he’d endorsed: about the simulation argument, about AI risk. But he needed to see something, up close, for his gut to really feel it. He treats this as a mistake. He asks himself what he would have predicted, last year, about seeing smart artificial agents in a vaguely-human simulated environment; or what he would’ve predicted about GPT-4’s capabilities. This may reveal a mistake, of sorts, but I think he misdiagnoses the mistake as primarily an error of Bayesian calibration. If it’s a mistake of calibration, it feels downstream of a mistake of imagination. Of a failure to richly picture the future you’re imagining, in all its messy complexity. The gut may be an unreliable narrator for the head, but, primarily, it’s a lazy narrator. It doesn’t fill in the details.
Returning to Ngo from Joe: our gut can listen to our head, but it struggles to properly hear it. There’s listening to someone’s words, and there’s making an attempt to inhabit the lifeworld they’re offering. If we’re to make arguments for AI risk – even academic ones – we should appeal to the gut, and help it out. We should give it the details required to help someone inhabit, fully, the stories of the future we take to be plausible. And, while I’ve detoured and nitpicked around Ngo’s paper, I can only do so because they’ve written an admirably clear piece on an important topic. Their piece deserves my nitpicking, and much more besides.
- ^
Enhanced capabilities could allow AIs to increase the rate of self-improvement; alternatively, it could be that, past a certain point, capability improvements hit diminishing returns. To claim that the former effect “ likely” dominates the latter feels too strong, and their claim was made without an argument or reference (beyond the entirety of Bostrom’s Superintelligence) to back it up.
- ^
In a commenting on a previous earlier draft, Sören emphasized the importance of AIs ‘planning towards internally represented goals’ in the paper. While my own view is that representation isn’t really an internal matter, I think my new story captures the intuitions behind the core dynamics they’re worried about. AI firms reason as unified systems, and take foresighted plans to achieve some end-state over a long time-horizons. If humans have internally represented goals, I think the AI firms in my story do too.
- ^
I think there’s at least some weak evidence one author views this paper in this way.
- ^
This discussion also makes me wonder if my attempt to treat the piece as a theory paper might be wrong. Perhaps instead the paper is better construed as a policy or strategy paper, aiming to highlight potentially concerning dynamics, without attempting to push grander theoretical claims licensing more specific constraints over expected future outcomes. If I read the paper through a policy lens, I think it’s much more successful at achieving its core aims. I think the paper presents a good case for small-scale risks, which (we might speculate) could become much worse in the future.
Narrow comment:
I think “reward hacking” (I think “reinforcement attractors” would be a better name) is dominated by questions of exploration. If systems explore into reinforcement attractors, they will stay there. If it’s easier to explore into a “false-alarm” reinforcement event (e.g. claw in front of ball), then that will probably happen. Same for the classic boat racing example.
I’m definitely on board with reinforcement attractors being a real problem which has been really observed.
But I think there’s a very important caveat here, which is that these reinforcement attractors have to be explored into (unless the system already is cognizant of its reinforcement function, and is “trying” to make the reinforcement number go up in general). I worry that this exploration component is not recognized, or is acknowledged as an afterthought. But AFAICT exploration is actually the main driver of whether reinforcement attractors are realized.
Couple of questions wrt to this:
Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have “awareness” of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like “You are the CEO of Walmart, you want to make the company maximally profitable” in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in “Defining and characterizing reward gaming” and the solution of “Stop before you encounter the attractors/hackable policy” seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don’t count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it’s LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?
EDIT: aren’t we risking that all the trope on evil AI act as an attractor for LLMs?