Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
I think my thought process when I typed “risk-averse money-maximizer” was that an agent could be risk-averse (in which case it wouldn’t be an EUM) and then separately be a money-maximizer.
But I didn’t explicitly think “the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility”, so I appreciate the clarification.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can’t prefer a probabilistic mixture of two options to the expectation of those two options. But that’s not the claim I made.
Hmm, this feels analogous to saying “companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they’ll interact with each other. Therefore we can reduce economics to studying isolated individuals”.
But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.
Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all “listen to” each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you’re doing a kind of clustering. That is, I think that the first step of talking about “individual rationality” is implicitly defining which coalitions qualify as individuals.
a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc.
This seems very unclear to me—in general it’s not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI’s behavior from when it was less intelligent, but then—as with humans—it’s hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that’s a separate point. If you can do that, you don’t need your mechanism above. If you can’t, then my objection still holds.
One argument for being optimistic: the universe is just very big, and there’s a lot to go around. So there’s a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there’s a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
- Apr 24, 2025, 3:15 PM; 4 points) 's comment on Richard Ngo’s Shortform by (
I found this a very interesting question to try to answer. My first reaction was that I don’t expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn’t competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the “country as a whole”.
For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they’d get from joining the council.
(This all feels very related to Seeing Like a State, which I’ve just started reading.)
So upon reflection: yes, it’s reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.
Note that I say “legible goals” instead of “EUM” because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.
I’ve changed my mind. Coming up with the ideas above has given me a better sense of how agent foundations progress could be valuable.
More concretely, I used to focus more on the ways in which agent foundations applied in the infinite-compute limit, which is not a setting that I think is very relevant. But I am more optimistic about agent foundations as the study of idealized multi-agent interactions. In hindsight a bunch of the best agent foundations research (e.g. logical induction) was both at once, but I’d only been viewing it as the former.
More abstractly, this update has made me more optimistic about conceptual progress in general. I guess it’s hard to viscerally internalize the extent to which it’s possible for concepts to break and reform in new configurations without having experienced that. (People who were strongly religious then deconverted have probably had a better sense of this all along, but I never had that experience.)
I’m glad I did the compromise before; it ended up reducing the consequences of this mistake somewhat (though I’d guess including one week of agent foundations in the course had a pretty small effect).
I think this addresses the problem I’m discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.
But in the case where it doesn’t (e.g. the source code is an uninterpretable neural network) you are left with the same problem.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
Consider you and I merging (say, in a marriage). Suppose that all points on the Pareto frontier involve us pursuing a fully-consistent strategy. But if some decisions are your responsibility, and other decisions are my responsibility, then we might end up with some of our actions inconsistent with others (say, if we haven’t had a chance to communicate before deciding). That’s not on the Pareto frontier.
What is on the Pareto frontier is you being dictator, and then accounting for my utility function when making your dictatorial decisions. But of course this is something I will object to, because in any realistic scenario I wouldn’t trust you enough to give you dictatorial power over me. Once you have that power, continuing to account for my utility is strongly non-incentive-compatible for you. So we’re more likely to each want to retain some power, even if it sometimes causes inefficiency. (The same is true on the level of countries, which accept a bunch of inefficiency from democratic competition in exchange for incentive-compatibility and trust.)
Another way of putting this: I’m focusing on the setting where you cannot do arbitrary merges, you can only do merges that are constructible via some set of calls to the existing agents. It’s often impossible to construct a fully-consistent merged agent without concentrating power in ways that the original agents would find undesirable (though there sometimes is, e.g. with I-cut-you-choose cake-cutting). So in this setting we need a different conception of rationality than Pareto-optimal.
Suppose you’re in a setting where the world is so large that you will only ever experience a tiny fraction of it directly, and you have to figure out the rest via generalization. Then your argument doesn’t hold up: shifting the mean might totally break your learning. But I claim that the real world is like this. So I am inherently skeptical of any result (like most convergence results) that rely on just trying approximately everything and gradually learning which to prefer and disprefer.
I just keep coming back to this comment, because there are a couple of lines in it that are downright poetic. I particularly appreciate:
“BE NOT AFRAID.” said the ants.
and
Yeah, and this universe’s got time in it, though.
and
can you imagine how horrible the world would get if we didn’t honour our precommitments?
Have you considered writing stories of your own?
if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around
I have a mental category of “results that are almost entirely irrelevant for realistically-computationally-bounded agents” (e.g. results related to AIXI), and my gut sense is that this seems like one such result.
Thanks! Note that Eric Neyman gets credit for the graphs.
Out of curiosity, what are the other sources that explain this? Any worth reading for other insights?
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting “The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal”.
“The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization” is a good summary of my position. You don’t say much that directly contradicts this, though I do think that even using the “terminal reward seeker” vs “schemer” distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it’ll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I’d add a few clarifications (nothing major):
The process of deciding on a new supergoal will probably involve systematizing not just “maximize reward” but also a bunch of other drives too—including ones which had previously been classified as special cases of “maximize reward” (e.g. “make humans happy”) but upon reflection are more naturally understood as special cases of the new supergoal.
It seems like you implicitly assume that the supergoal will be “in charge”. But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human’s System 2 motivations and System 1 motivations). I call the spectrum from “all power is at the top” to “all power is at the bottom” the systematizing-conservatism spectrum.
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.
I left some comments on an earlier version of AI 2027; the most relevant is the following:
June 2027: Self-improving AI
OpenBrain now has a “country of geniuses in a datacenter.”
Most of the humans at OpenBrain can’t usefully contribute anymore. Some don’t realize this and harmfully micromanage their AI teams. Others sit at their computer screens, watching performance crawl up, and up, and up.
This is the point where I start significantly disagreeing with the scenario. My expectation is that by this point humans are still better at tasks that take a week or longer. Also, it starts getting really tricky to improve on these, because you get limited by a number of factors: it takes a long time to get real-world feedback, it takes a lot of compute to experiment on week-long tasks, etc.
I expect these dynamics to be particularly notable when it comes to coordinating Agent-4 copies. Like, probably a lot of p-hacking, then other agents knowing that p-hacking is happening and covering it up, and so on. I expect a lot of the human time will involve trying to detect clusters of Agent-4 copies that are spiralling off into doing wacky stuff. Also at this point the metrics of performance won’t be robust enough to avoid agents goodharting hard on them.
Good question. Part of the difference is I think the difference between total govt (state + federal) outlays and just federal outlays. But I don’t think that would explain all of the discrepancy.
Your question reminds me that I had a discussion with someone else who was skeptical about this graph, and made a note to dig deeper, but never got around to it. I’ll chase this up a bit now and see what I find.
Ah, gotcha. Unfortunately I have rejected the concept of Bayesian evidence from my ontology and therefore must regard your claim as nonsense. Alas.
(more seriously, sorry for misinterpreting your tone, I have been getting flak from all directions for this talk so am a bit trigger-happy)
I was referring to the inference:
The explanation that it was done by “a new hire” is a classic and easy scapegoat. It’s much more straight forward to believe Musk himself wanted this done, and walked it back when it was clear it was more obvious than intended.
Obviously this sort of leap to a conclusion is very different from the sort of evidence that one expects upon hearing that literal written evidence (of Musk trying to censor) exists. Given this, your comment seems remarkably unproductive.
Approximately every contentious issue has caused tremendous amounts of real-world pain. Therefore the choice of which issues to police contempt about becomes a de facto political standard.