Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting “The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal”.
“The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization” is a good summary of my position. You don’t say much that directly contradicts this, though I do think that even using the “terminal reward seeker” vs “schemer” distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it’ll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I’d add a few clarifications (nothing major):
The process of deciding on a new supergoal will probably involve systematizing not just “maximize reward” but also a bunch of other drives too—including ones which had previously been classified as special cases of “maximize reward” (e.g. “make humans happy”) but upon reflection are more naturally understood as special cases of the new supergoal.
It seems like you implicitly assume that the supergoal will be “in charge”. But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human’s System 2 motivations and System 1 motivations). I call the spectrum from “all power is at the top” to “all power is at the bottom” the systematizing-conservatism spectrum.
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.
I left some comments on an earlier version of AI 2027; the most relevant is the following:
June 2027: Self-improving AI
OpenBrain now has a “country of geniuses in a datacenter.”
Most of the humans at OpenBrain can’t usefully contribute anymore. Some don’t realize this and harmfully micromanage their AI teams. Others sit at their computer screens, watching performance crawl up, and up, and up.
This is the point where I start significantly disagreeing with the scenario. My expectation is that by this point humans are still better at tasks that take a week or longer. Also, it starts getting really tricky to improve on these, because you get limited by a number of factors: it takes a long time to get real-world feedback, it takes a lot of compute to experiment on week-long tasks, etc.
I expect these dynamics to be particularly notable when it comes to coordinating Agent-4 copies. Like, probably a lot of p-hacking, then other agents knowing that p-hacking is happening and covering it up, and so on. I expect a lot of the human time will involve trying to detect clusters of Agent-4 copies that are spiralling off into doing wacky stuff. Also at this point the metrics of performance won’t be robust enough to avoid agents goodharting hard on them.
Well-foundedness as an organizing principle of healthy minds and societies
Good question. Part of the difference is I think the difference between total govt (state + federal) outlays and just federal outlays. But I don’t think that would explain all of the discrepancy.
Your question reminds me that I had a discussion with someone else who was skeptical about this graph, and made a note to dig deeper, but never got around to it. I’ll chase this up a bit now and see what I find.
Ah, gotcha. Unfortunately I have rejected the concept of Bayesian evidence from my ontology and therefore must regard your claim as nonsense. Alas.
(more seriously, sorry for misinterpreting your tone, I have been getting flak from all directions for this talk so am a bit trigger-happy)
I was referring to the inference:
The explanation that it was done by “a new hire” is a classic and easy scapegoat. It’s much more straight forward to believe Musk himself wanted this done, and walked it back when it was clear it was more obvious than intended.
Obviously this sort of leap to a conclusion is very different from the sort of evidence that one expects upon hearing that literal written evidence (of Musk trying to censor) exists. Given this, your comment seems remarkably unproductive.
EDIT: upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you’re talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
Original comment:
The term “careful thinking” here seems to be doing a lot of work, and I’m worried that there’s a kind of motte and bailey going on. In your earlier comment you describe it as “analytical philosophy, or more broadly careful/skeptical philosophy”. But I think we agree that most academic analytic philosophy is bad, and often worse than laypeople’s intuitive priors (in part due to strong selection effects on who enters the field—most philosophers of religion believe in god, most philosophers of aesthetics believe in the objectivity of aesthetics, etc).
So then we can fall back on LessWrong as an example of careful thinking. But as we discussed above, even the leading figure on LessWrong was insufficiently careful even about the main focus of his work for it to be robustly valuable.
So I basically get the sense that the role of careful thinking in your worldview is something like “the thing that I, Wei Dai, ascribe my success to”. And I do agree that you’ve been very successful in a bunch of intellectual endeavours. But I expect that your “secret sauce” is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was “being in a community that prioritized careful thinking”. And then I also think you’re missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
In other words, the bailey seems to be “careful thinking is the thing we should prioritize in order to make the world better”, and the motte is “I, Wei Dai, seem to be doing something good, even if basically everyone else is falling into the valley of bad rationality”.
One reason I’m personally pushing back on this, btw, is that my own self-narrative for why I’m able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I’m willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones. Similarly, a lot of the value that the wider blogosphere has created comes from people being less careful than existing academic norms (including Eliezer and Scott Alexander, whose best works are often quite polemic).
In short: I totally think we want more people coming up with good ideas, and that this is a big bottleneck. But there are many different directions in which we should tug people in order to make them more intellectually productive. Many academics should be less careful. Many people on LessWrong should be more careful. Some scientists should be less empirical, others should be more empirical; some less mathematically rigorous, others more mathematically rigorous. Others should try to live in countries that are less repressive of new potentially-crazy ideas (hence politics being important). And then, of course, others should be figuring out how to actually get good ideas implemented.
Meanwhile, Eliezer and Sam and Elon should have had less of a burning desire to found an AGI lab. I agree that this can be described by “wanting to be the hero who saves the world”, but this seems to function as a curiosity stopper for you. When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”). I think of extremely strong motivations to change the world (as these outlier figures have) as typically driven by some kind of core emotional dysregulation. And specifically I think of fear-based motivation as the underlying phenomenon which implements status-seeking and many other behaviors which are harmful when taken too far. (This is not an attempt to replace evo-psych, btw—it’s an account of the implementation mechanisms that evolution used to get us to do the things it wanted, which now are sometimes maladapted to our current environment.) I write about a bunch of these models in my Replacing Fear sequence.
I didn’t end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:
I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.
Whereas coalitions are typically divided into sub-coalitions (e.g. the “western civilization” coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so “coalitional” seems much closer to capturing this fractal/scale-free property.
Ah, yeah, I was a bit unclear here.
Basic idea is that by conquering someone you may not reduce their welfare very much short-term, but you do reduce their power a lot short-term. (E.g. the British conquered India with relatively little welfare impacts on most Indians.)
And so it is much harder to defend a conquest as altruistic in the sense of empowering, than it is to defend a conquest as altruistic in the sense of welfare-increasing.
As you say, this is not a perfect defense mechanism, because sometimes long-term empowerment and short-term empowerment conflict. But there are often strategies that are less disempowering short-term which the moral pressure of “altruism=empowerment” would push people towards. E.g. it would make it harder for people to set up the “dictatorship of the proletariat” in the first place.
And in general I think it’s actually pretty uncommon for temporary disempowerment to be necessary for long-term empowerment. Re your homework example, there’s a wide spectrum from the highly-empowering Taking Children Seriously to highly-disempowering Asian tiger parents, and I don’t think it’s a coincidence that tiger parenting often backfires. Similarly, mandatory mobilization disproportionately happens during wars fought for the wrong reasons.
yepp, see my other comment which anticipated this
In general I disagree pretty broadly with your view. Not quite sure how best to surface that disagreement but will give a quick shot:
I think it’s important to be capable of (at least) two types of reasoning:
Precise reasoning about desired outcomes and strategies to get there.
Broad reasoning about heuristics that seem robustly good.
We see this in the domain of morality, for example: utilitarianism is more like the former, deontology is more like the latter. High-level ideological goals tend to go pretty badly if people stop paying attention to robust deontological heuristics (like “don’t kill people”). As Eliezer has said somewhere, one of the key reasons to be deontological is that we’re running on corrupted hardware. But more generally, we’re running on logically uncertain hardware: we can’t model all the flow-through effects of our actions on other reasonably intelligent people (hell, we can’t even model all the flow-through effects of our actions on, say, animals—who can often “read” us in ways we’re not tracking). And so we often should be adopting robust-seeming heuristics even when we don’t know exactly why they work.
If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don’t think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.
This is part of why I’m less sold on “careful philosophical reasoning” as the key thing. Indeed, wanting to “commit prematurely to a specific, detailed value system” is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more “normal” people don’t take it that seriously even when they’re nominally on board). And so it’s very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.
Then we get to x-risk. That’s a domain where many broad heuristics break down (though still fewer than people think, as I’ll write about soon). And you might say: well, without careful philosophical reasoning, we wouldn’t have identified AI x-risk as a priority. Yes, but also: it’s very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative. I describe some mechanisms halfway through this talk, but here are a couple that directly relate to the factors I mentioned in my last comment:
First, when people on LessWrong spread the word about AI risk, extreme psychological outliers like Sam Altman and Elon Musk then jump to do AI-related things in a way which often turns out to be destructive because of their trust issues and psychological neuroses.
Second, US governmental responses to AI risk are very much bottlenecked on being a functional government in general, which is bottlenecked by political advocacy (broadly construed) slash political power games.
Third, even within the AI safety community you have a bunch of people contributing to expectations of conflict with China (e.g. Leopold Aschenbrenner and Dan Hendrycks) and acceleration in general (e.g. by working on capabilities at Anthropic, or RSI evals) in a way which I hypothesize would be much better for the world if they had better introspection capabilities (I know this is a strong claim, I have an essay coming out on it soon).
And so even here it seems like a bunch of heuristics (such as “it’s better when people are mentally healthier” and “it’s better when politics is more functional”) actually were strong bottlenecks on the application of philosophical reasoning to do good. And I don’t think this is a coincidence.
tl;dr: careful philosophical reasoning is just one direction in which you can converge on a robustly good strategy for the future, and indeed is one of the more risky avenues by which to do so.
Oh, I guess I said “Elon wants xAI to produce a maximally truth-seeking AI, really decentralizing control over information”.
Yeah, in hindsight I should have been more careful to distinguish between my descriptions of people’s political platform, and my inferences about what they “really want”. The thing I was trying to describe was more like “what is the stance of this group” than “do people in the group actually believe the stance”.
A more accurate read of what the “real motivations” are would have been something like “you prevent it by using decentralization, until you’re in a position where you can centralize power yourself, and then you try to centralize power yourself”.
(Though that’s probably a bit too cynical—I think there are still parts of Elon that have a principled belief in decentralization. My guess is just that they won’t win out over his power-seeking parts when push comes to shove.)
Hm, the fact that you replied to me makes it seem like you’re disagreeing with me? But I basically agree with everything you said in this comment. My disagreement was about the specific example that Isopropylpod gave.
Thanks for the comment! A few replies:
I don’t mean to imply that subagents are totally separate entities. At the very least they all can access many shared facts and experiences.
And I don’t think that reuse of subcomponents is mutually exclusive from the mechanisms I described. In fact, you could see my mechanisms as attempts to figure out which subcomponents are used for coordination. (E.g. if a bunch of subagents are voting/bargaining over which goal to pursue, probably the goal that they land on will be one that’s pretty comprehensible to most of them.)
Re shards: there are a bunch of similarities. But it seemed to me that shard theory was focused on pretty simple subagents. E.g. from the original post: “Human values are … sets of contextually activated heuristics”; and later “human values are implemented by contextually activated circuits which activate in situations downstream of past reinforcement so as to steer decision-making towards the objects of past reinforcement”.
Whereas I think of many human values as being constituted by subagents that are far too complex to be described in that way. In my view, many important subagents are sophisticated enough that basically any description you give of them would also have to be a description of a whole human (e.g. if you wouldn’t describe a human as a “contextually activated circuit”, then you shouldn’t describe subagents that way).
This may just be a vibes difference; many roads lead to Rome. But the research directions I’ve laid out above are very distinct from the ones that shard theory people are working on.
EDIT: more on shards here.
FWIW I think we’ve found one crucial angle on moral progress, but that this isn’t as surprising/coincidental as it may seem because there are several other angles on moral progress that are comparably important, including:
Political activism (e.g. free speech activism, various whistleblowers) that maintains societies in which moral progress can be made.
(The good parts of) neuroscience/psychology, which are making progress towards empirically-grounded theories of cognition, and thereby have (and will) teach us a lot about moral cognition.
Various approaches to introspection + emotional health (including buddhism, some therapy modalities, some psychiatry). These produce the internal clarity that is crucial for embodying + instantiating moral progress.
Some right-wing philosophers who I think are grappling with important aspects of moral progress that are too controversial for LessWrong (I don’t want to elaborate here because it’ll inevitably take over the thread, but am planning to write at more length about this soonish).
We disagree on which explanation is more straightforward, but regardless, that type of inference is very different from “literal written evidence”.
One of the main ways I think about empowerment is in terms of allowing better coordination between subagents.
In the case of an individual human, extreme morality can be seen as one subagent seizing control and overriding other subagents (like the ones who don’t want to chop off body parts).
In the case of a group, extreme morality can be seen in terms of preference cascades that go beyond what most (or even any) of the individuals involved with them would individually prefer.
In both cases, replacing fear-based motivation with less coercive/more cooperative interactions between subagents would go a long way towards reducing value drift.
Thanks! Note that Eric Neyman gets credit for the graphs.
Out of curiosity, what are the other sources that explain this? Any worth reading for other insights?