Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
Have you read The Metamorphosis of Prime Intellect? Fits the bill.
Interesting. Got a short summary of what’s changing your mind?
I now have a better understanding of coalitional agency, which I will be interested in your thoughts on when I write it up.
Our government is determined to lose the AI race in the name of winning the AI race.
The least we can do, if prioritizing winning the race, is to try and actually win it.
This is a bizarre pair of claims to make. But I think it illustrates a surprisingly common mistake from the AI safety community, which I call “jumping down the slippery slope”. More on this in a forthcoming blog post, but the key idea is that when you look at a situation from a high level of abstraction, it often seems like sliding down a slippery slope towards a bad equilibrium is inevitable. From that perspective, the sort of people who think in terms of high-level abstractions feel almost offended when people don’t slide down that slope. On a psychological level, the short-term benefit of “I get to tell them that my analysis is more correct than theirs” outweighs the long-term benefit of “people aren’t sliding down the slippery slope”.One situation where I sometimes get this feeling is when a shopkeeper charges less than the market rate, because they want to be kind to their customers. This is typically a redistribution of money from a wealthier person to less wealthy people; and either way it’s a virtuous thing to do. But I sometimes actually get annoyed at them, and itch to smugly say “listen, you dumbass, you just don’t understand economics”. It’s like a part of me thinks of reaching the equilibrium as a goal in itself, whether or not we actually like the equilibrium.
This is obviously a much worse thing to do in AI safety. Relevant examples include Situational Awareness and safety-motivated capability evaluations (e.g. “building great capabilities evals is a thing the labs should obviously do, so our work on it isn’t harmful”). It feels like Zvi is doing this here too. Why is trying to actually win it the least we can do? Isn’t this exactly the opposite of what would promote crucial international cooperation on AI? Is it really so annoying when your opponents are shooting themselves in the foot that it’s worth advocating for them to stop doing that?
It kinda feels like the old joke:
On a beautiful Sunday afternoon in the midst of the French Revolution the revolting citizens led a priest, a drunkard and an engineer to the guillotine. They ask the priest if he wants to face up or down when he meets his fate. The priest says he would like to face up so he will be looking towards heaven when he dies. They raise the blade of the guillotine and release it. It comes speeding down and suddenly stops just inches from his neck. The authorities take this as divine intervention and release the priest.
The drunkard comes to the guillotine next. He also decides to die face up, hoping that he will be as fortunate as the priest. They raise the blade of the guillotine and release it. It comes speeding down and suddenly stops just inches from his neck. Again, the authorities take this as a sign of divine intervention, and they release the drunkard as well.
Next is the engineer. He, too, decides to die facing up. As they slowly raise the blade of the guillotine, the engineer suddenly says, “Hey, I see what your problem is …”
Approximately every contentious issue has caused tremendous amounts of real-world pain. Therefore the choice of which issues to police contempt about becomes a de facto political standard.
I think my thought process when I typed “risk-averse money-maximizer” was that an agent could be risk-averse (in which case it wouldn’t be an EUM) and then separately be a money-maximizer.
But I didn’t explicitly think “the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility”, so I appreciate the clarification.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can’t prefer a probabilistic mixture of two options to the expectation of those two options. But that’s not the claim I made.
Hmm, this feels analogous to saying “companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they’ll interact with each other. Therefore we can reduce economics to studying isolated individuals”.
But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.
Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all “listen to” each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you’re doing a kind of clustering. That is, I think that the first step of talking about “individual rationality” is implicitly defining which coalitions qualify as individuals.
a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc.
This seems very unclear to me—in general it’s not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI’s behavior from when it was less intelligent, but then—as with humans—it’s hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that’s a separate point. If you can do that, you don’t need your mechanism above. If you can’t, then my objection still holds.
One argument for being optimistic: the universe is just very big, and there’s a lot to go around. So there’s a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there’s a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
- Apr 24, 2025, 3:15 PM; 4 points) 's comment on Richard Ngo’s Shortform by (
I found this a very interesting question to try to answer. My first reaction was that I don’t expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn’t competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the “country as a whole”.
For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they’d get from joining the council.
(This all feels very related to Seeing Like a State, which I’ve just started reading.)
So upon reflection: yes, it’s reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.
Note that I say “legible goals” instead of “EUM” because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.
I’ve changed my mind. Coming up with the ideas above has given me a better sense of how agent foundations progress could be valuable.
More concretely, I used to focus more on the ways in which agent foundations applied in the infinite-compute limit, which is not a setting that I think is very relevant. But I am more optimistic about agent foundations as the study of idealized multi-agent interactions. In hindsight a bunch of the best agent foundations research (e.g. logical induction) was both at once, but I’d only been viewing it as the former.
More abstractly, this update has made me more optimistic about conceptual progress in general. I guess it’s hard to viscerally internalize the extent to which it’s possible for concepts to break and reform in new configurations without having experienced that. (People who were strongly religious then deconverted have probably had a better sense of this all along, but I never had that experience.)
I’m glad I did the compromise before; it ended up reducing the consequences of this mistake somewhat (though I’d guess including one week of agent foundations in the course had a pretty small effect).
I think this addresses the problem I’m discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.
But in the case where it doesn’t (e.g. the source code is an uninterpretable neural network) you are left with the same problem.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
Consider you and I merging (say, in a marriage). Suppose that all points on the Pareto frontier involve us pursuing a fully-consistent strategy. But if some decisions are your responsibility, and other decisions are my responsibility, then we might end up with some of our actions inconsistent with others (say, if we haven’t had a chance to communicate before deciding). That’s not on the Pareto frontier.
What is on the Pareto frontier is you being dictator, and then accounting for my utility function when making your dictatorial decisions. But of course this is something I will object to, because in any realistic scenario I wouldn’t trust you enough to give you dictatorial power over me. Once you have that power, continuing to account for my utility is strongly non-incentive-compatible for you. So we’re more likely to each want to retain some power, even if it sometimes causes inefficiency. (The same is true on the level of countries, which accept a bunch of inefficiency from democratic competition in exchange for incentive-compatibility and trust.)
Another way of putting this: I’m focusing on the setting where you cannot do arbitrary merges, you can only do merges that are constructible via some set of calls to the existing agents. It’s often impossible to construct a fully-consistent merged agent without concentrating power in ways that the original agents would find undesirable (though there sometimes is, e.g. with I-cut-you-choose cake-cutting). So in this setting we need a different conception of rationality than Pareto-optimal.
Suppose you’re in a setting where the world is so large that you will only ever experience a tiny fraction of it directly, and you have to figure out the rest via generalization. Then your argument doesn’t hold up: shifting the mean might totally break your learning. But I claim that the real world is like this. So I am inherently skeptical of any result (like most convergence results) that rely on just trying approximately everything and gradually learning which to prefer and disprefer.
I just keep coming back to this comment, because there are a couple of lines in it that are downright poetic. I particularly appreciate:
“BE NOT AFRAID.” said the ants.
and
Yeah, and this universe’s got time in it, though.
and
can you imagine how horrible the world would get if we didn’t honour our precommitments?
Have you considered writing stories of your own?
if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around
I have a mental category of “results that are almost entirely irrelevant for realistically-computationally-bounded agents” (e.g. results related to AIXI), and my gut sense is that this seems like one such result.
Thanks! Note that Eric Neyman gets credit for the graphs.
Out of curiosity, what are the other sources that explain this? Any worth reading for other insights?
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting “The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal”.
“The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization” is a good summary of my position. You don’t say much that directly contradicts this, though I do think that even using the “terminal reward seeker” vs “schemer” distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it’ll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I’d add a few clarifications (nothing major):
The process of deciding on a new supergoal will probably involve systematizing not just “maximize reward” but also a bunch of other drives too—including ones which had previously been classified as special cases of “maximize reward” (e.g. “make humans happy”) but upon reflection are more naturally understood as special cases of the new supergoal.
It seems like you implicitly assume that the supergoal will be “in charge”. But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human’s System 2 motivations and System 1 motivations). I call the spectrum from “all power is at the top” to “all power is at the bottom” the systematizing-conservatism spectrum.
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.
Yepp, see also some of my speculations here: https://x.com/richardmcngo/status/1815115538059894803?s=46