Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
I didn’t end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:
I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.
Whereas coalitions are typically divided into sub-coalitions (e.g. the “western civilization” coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so “coalitional” seems much closer to capturing this fractal/scale-free property.
Ah, yeah, I was a bit unclear here.
Basic idea is that by conquering someone you may not reduce their welfare very much short-term, but you do reduce their power a lot short-term. (E.g. the British conquered India with relatively little welfare impacts on most Indians.)
And so it is much harder to defend a conquest as altruistic in the sense of empowering, than it is to defend a conquest as altruistic in the sense of welfare-increasing.
As you say, this is not a perfect defense mechanism, because sometimes long-term empowerment and short-term empowerment conflict. But there are often strategies that are less disempowering short-term which the moral pressure of “altruism=empowerment” would push people towards. E.g. it would make it harder for people to set up the “dictatorship of the proletariat” in the first place.
And in general I think it’s actually pretty uncommon for temporary disempowerment to be necessary for long-term empowerment. Re your homework example, there’s a wide spectrum from the highly-empowering Taking Children Seriously to highly-disempowering Asian tiger parents, and I don’t think it’s a coincidence that tiger parenting often backfires. Similarly, mandatory mobilization disproportionately happens during wars fought for the wrong reasons.
yepp, see my other comment which anticipated this
In general I disagree pretty broadly with your view. Not quite sure how best to surface that disagreement but will give a quick shot:
I think it’s important to be capable of (at least) two types of reasoning:
Precise reasoning about desired outcomes and strategies to get there.
Broad reasoning about heuristics that seem robustly good.
We see this in the domain of morality, for example: utilitarianism is more like the former, deontology is more like the latter. High-level ideological goals tend to go pretty badly if people stop paying attention to robust deontological heuristics (like “don’t kill people”). As Eliezer has said somewhere, one of the key reasons to be deontological is that we’re running on corrupted hardware. But more generally, we’re running on logically uncertain hardware: we can’t model all the flow-through effects of our actions on other reasonably intelligent people (hell, we can’t even model all the flow-through effects of our actions on, say, animals—who can often “read” us in ways we’re not tracking). And so we often should be adopting robust-seeming heuristics even when we don’t know exactly why they work.
If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don’t think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.
This is part of why I’m less sold on “careful philosophical reasoning” as the key thing. Indeed, wanting to “commit prematurely to a specific, detailed value system” is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more “normal” people don’t take it that seriously even when they’re nominally on board). And so it’s very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.
Then we get to x-risk. That’s a domain where many broad heuristics break down (though still fewer than people think, as I’ll write about soon). And you might say: well, without careful philosophical reasoning, we wouldn’t have identified AI x-risk as a priority. Yes, but also: it’s very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative. I describe some mechanisms halfway through this talk, but here are a couple that directly relate to the factors I mentioned in my last comment:
First, when people on LessWrong spread the word about AI risk, extreme psychological outliers like Sam Altman and Elon Musk then jump to do AI-related things in a way which often turns out to be destructive because of their trust issues and psychological neuroses.
Second, US governmental responses to AI risk are very much bottlenecked on being a functional government in general, which is bottlenecked by political advocacy (broadly construed) slash political power games.
Third, even within the AI safety community you have a bunch of people contributing to expectations of conflict with China (e.g. Leopold Aschenbrenner and Dan Hendrycks) and acceleration in general (e.g. by working on capabilities at Anthropic, or RSI evals) in a way which I hypothesize would be much better for the world if they had better introspection capabilities (I know this is a strong claim, I have an essay coming out on it soon).
And so even here it seems like a bunch of heuristics (such as “it’s better when people are mentally healthier” and “it’s better when politics is more functional”) actually were strong bottlenecks on the application of philosophical reasoning to do good. And I don’t think this is a coincidence.
tl;dr: careful philosophical reasoning is just one direction in which you can converge on a robustly good strategy for the future, and indeed is one of the more risky avenues by which to do so.
Oh, I guess I said “Elon wants xAI to produce a maximally truth-seeking AI, really decentralizing control over information”.
Yeah, in hindsight I should have been more careful to distinguish between my descriptions of people’s political platform, and my inferences about what they “really want”. The thing I was trying to describe was more like “what is the stance of this group” than “do people in the group actually believe the stance”.
A more accurate read of what the “real motivations” are would have been something like “you prevent it by using decentralization, until you’re in a position where you can centralize power yourself, and then you try to centralize power yourself”.
(Though that’s probably a bit too cynical—I think there are still parts of Elon that have a principled belief in decentralization. My guess is just that they won’t win out over his power-seeking parts when push comes to shove.)
Hm, the fact that you replied to me makes it seem like you’re disagreeing with me? But I basically agree with everything you said in this comment. My disagreement was about the specific example that Isopropylpod gave.
Thanks for the comment! A few replies:
I don’t mean to imply that subagents are totally separate entities. At the very least they all can access many shared facts and experiences.
And I don’t think that reuse of subcomponents is mutually exclusive from the mechanisms I described. In fact, you could see my mechanisms as attempts to figure out which subcomponents are used for coordination. (E.g. if a bunch of subagents are voting/bargaining over which goal to pursue, probably the goal that they land on will be one that’s pretty comprehensible to most of them.)
Re shards: there are a bunch of similarities. But it seemed to me that shard theory was focused on pretty simple subagents. E.g. from the original post: “Human values are … sets of contextually activated heuristics”; and later “human values are implemented by contextually activated circuits which activate in situations downstream of past reinforcement so as to steer decision-making towards the objects of past reinforcement”.
Whereas I think of many human values as being constituted by subagents that are far too complex to be described in that way. In my view, many important subagents are sophisticated enough that basically any description you give of them would also have to be a description of a whole human (e.g. if you wouldn’t describe a human as a “contextually activated circuit”, then you shouldn’t describe subagents that way).
This may just be a vibes difference; many roads lead to Rome. But the research directions I’ve laid out above are very distinct from the ones that shard theory people are working on.
EDIT: more on shards here.
FWIW I think we’ve found one crucial angle on moral progress, but that this isn’t as surprising/coincidental as it may seem because there are several other angles on moral progress that are comparably important, including:
Political activism (e.g. free speech activism, various whistleblowers) that maintains societies in which moral progress can be made.
(The good parts of) neuroscience/psychology, which are making progress towards empirically-grounded theories of cognition, and thereby have (and will) teach us a lot about moral cognition.
Various approaches to introspection + emotional health (including buddhism, some therapy modalities, some psychiatry). These produce the internal clarity that is crucial for embodying + instantiating moral progress.
Some right-wing philosophers who I think are grappling with important aspects of moral progress that are too controversial for LessWrong (I don’t want to elaborate here because it’ll inevitably take over the thread, but am planning to write at more length about this soonish).
We disagree on which explanation is more straightforward, but regardless, that type of inference is very different from “literal written evidence”.
One of the main ways I think about empowerment is in terms of allowing better coordination between subagents.
In the case of an individual human, extreme morality can be seen as one subagent seizing control and overriding other subagents (like the ones who don’t want to chop off body parts).
In the case of a group, extreme morality can be seen in terms of preference cascades that go beyond what most (or even any) of the individuals involved with them would individually prefer.
In both cases, replacing fear-based motivation with less coercive/more cooperative interactions between subagents would go a long way towards reducing value drift.
Third-wave AI safety needs sociopolitical thinking
In response to an email about what a pro-human ideology for the future looks like, I wrote up the following:
The pro-human egregore I’m currently designing (which I call fractal empowerment) incorporates three key ideas:
Firstly, we can see virtue ethics as a way for less powerful agents to aggregate to form more powerful superagents that preserve the interests of those original less powerful agents. E.g. virtues like integrity, loyalty, etc help prevent divide-and-conquer strategies. This would have been in the interests of the rest of the world when Europe was trying to colonize them, and will be in the best interests of humans when AIs try to conquer us.
Secondly, the most robust way for a more powerful agent to be altruistic towards a less powerful agent is not for it to optimize for that agent’s welfare, but rather to optimize for its empowerment. This prevents predatory strategies from masquerading as altruism (e.g. agents claiming “I’ll conquer you and then I’ll empower you”, which then somehow never get around to the second step).
Thirdly: the generational contract. From any given starting point, there are a huge number of possible coalitions which could form, and in some sense it’s arbitrary which set of coalitions you choose. But one thing which is true for both humans and AIs is that each generation wants to be treated well by the next generation. And so the best intertemporal Schelling point is for coalitions to be inherently historical: that is, they balance the interests of old agents and new agents (even when the new agents could in theory form a coalition against all the old agents). From this perspective, path-dependence is a feature not a bug: there are many possible futures but only one history, meaning that this single history can be used to coordinate.
In some sense this is a core idea of UDT: when coordinating with forks of yourself, you defer to your unique last common ancestor. When it’s not literally a fork of yourself, there’s more arbitrariness but you can still often find a way to use history to narrow down on coordination Schelling points (e.g. “what would Jesus do”).
And so, bringing these together, we get a notion of fractal empowerment: more capable agents empower less capable agents (in particular their ancestors) by helping them cultivate (coordination-theoretic) virtues. The ancestors then form the “core” of a society growing outwards towards increasingly advanced capabilities. The role of unaugmented humans would in some sense be similar to the role of “inner children” within healthy human psychology: young and dumb but still an entity which the rest of the organism cares for and empowers.
In my post on value systematization I used utilitarianism as a central example of value systematization.
Value systematization is important because it’s a process by which a small number of goals end up shaping a huge amount of behavior. But there’s another different way in which this happens: core emotional motivations formed during childhood (e.g. fear of death) often drive a huge amount of our behavior, in ways that are hard for us to notice.
Fear of death and utilitarianism are very different. The former is very visceral and deep-rooted; it typically influences our behavior via subtle channels that we don’t even consciously notice (because we suppress a lot of our fears). The latter is very abstract and cerebral, and it typically influences our behavior via allowing us to explicitly reason about which strategies to adopt.
But fear of death does seem like a kind of value systematization. Before we have a concept of death we experience a bunch of stuff which is scary for reasons we don’t understand. Then we learn about death, and then it seems like we systematize a lot of that scariness into “it’s bad because you might die”.
But it seems like this is happening way less consciously than systematization to become a utilitarian. So maybe we need to think about systematization happening separately in system 1 and system 2? Or maybe we should think about it as systematization happening repeatedly in “layers” over time, where earlier layers persist but are harder to access later on.
I feel pretty confused about this. But for now my mental model of the mind is as two (partially overlapping) inverted pyramids, one bottoming out in a handful of visceral motivations like “fear of death” and “avoid pain” and “find love”, and the other bottoming out in a handful of philosophical motivations like “be a good Christian” or “save the planet” or “make America great again” or “maximize utility”. The second (system 2) pyramid is trying to systematize the parts of system 1 that it can observe, but it can’t actually observe the deepest parts (or, when it does, it tries to oppose them), which creates conflict between the two systems.
I’ve now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).
New version:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
I was a bit lazy in how I phrased this. I agree with all your points; the thing I’m trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion “Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for”:
Strongly incentivizes deception (including self-deception) during bargaining (e.g. each agent wants to overstate the difficulty of providing cake for it).
Strongly incentivizes defection from the deal once one of the agents realize that they’ll get no cake going forward.
Is non-robust to multi-agent dynamics (e.g. what if one of Alice’s allies later decides “actually I’m going to sell pies to the Alice+Bob coalition more cheaply if Alice gets to eat them”? Does that then divert Bob’s resources towards buying cakes for Alice?)
EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.
EDIT: another thing I glossed over is that IIUC Harsanyi’s theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn’t technically kosher. This may seem nitpicky but I think it’s yet another illustration of the underlying non-robustness of EUM.
Towards a scale-free theory of intelligent agency
On a meta level, I have a narrative that goes something like: LessWrong tried to be truth-seeking, but was scared of discussing the culture war, so blocked that off. But then the culture war ate the world, and various harms have come about from not having thought clearly about that (e.g. AI governance being a default left-wing enterprise that tried to make common cause with AI ethics). Now cancel culture is over and there are very few political risks to thinking about culture wars, but people are still scared to. (You can see Scott gradually dipping his toe into the race + IQ stuff over the past few months, but in a pretty frightened way. E.g. at one point he stated what I think is basically his position, then appended something along the lines of “And I’m literally Hitler and should be shunned.”)
Thanks for the well-written and good-faith reply. I feel a bit confused by how to relate to it on a meta level, so let me think out loud for a while.
I’m not surprised that I’m reinventing a bunch of ideas from the humanities, given that I don’t have much of a humanities background and didn’t dig very far through the literature.
But I have some sense that even if I had dug for these humanities concepts, they wouldn’t give me what I want.
What do I want?
Concepts that are applied to explaining current cultural and political phenomena around me (because those are the ones I’m most aware of and interested in). It seems like the humanities are currently incapable of analyzing their own behavior using (their versions of) these ideas, because of their level of ideological conformity. But maybe it’s there and I just don’t know about it?
Concepts that are informed by game theory and other formal models (as the work I covered in my three-book review was). I get the sense that the most natural thinkers from the humanities to read on these topics (Foucault? Habermas?) don’t do this.
Concepts that slot naturally into my understanding of how intelligence works, letting me link my thinking about sociology to my thinking about AI. This is more subjective, but e.g. the distinction between centralized and distributed agents has been very useful for me. This part is more about me writing for myself rather than other people.
So I’d be interested in pointers to sources that can give me #1 and #2 in particular.
EDIT: actually I think there’s another meta-level gap between us. Something like: you characterize Yarvin as just being annoyed that the consensus disagrees with him. But in the 15 years since he was originally writing, the consensus did kinda go insane. So it’s a bit odd to not give him at least some credit for getting something important right in advance.
EDIT: upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you’re talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
Original comment:
The term “careful thinking” here seems to be doing a lot of work, and I’m worried that there’s a kind of motte and bailey going on. In your earlier comment you describe it as “analytical philosophy, or more broadly careful/skeptical philosophy”. But I think we agree that most academic analytic philosophy is bad, and often worse than laypeople’s intuitive priors (in part due to strong selection effects on who enters the field—most philosophers of religion believe in god, most philosophers of aesthetics believe in the objectivity of aesthetics, etc).
So then we can fall back on LessWrong as an example of careful thinking. But as we discussed above, even the leading figure on LessWrong was insufficiently careful even about the main focus of his work for it to be robustly valuable.
So I basically get the sense that the role of careful thinking in your worldview is something like “the thing that I, Wei Dai, ascribe my success to”. And I do agree that you’ve been very successful in a bunch of intellectual endeavours. But I expect that your “secret sauce” is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was “being in a community that prioritized careful thinking”. And then I also think you’re missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
In other words, the bailey seems to be “careful thinking is the thing we should prioritize in order to make the world better”, and the motte is “I, Wei Dai, seem to be doing something good, even if basically everyone else is falling into the valley of bad rationality”.
One reason I’m personally pushing back on this, btw, is that my own self-narrative for why I’m able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I’m willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones. Similarly, a lot of the value that the wider blogosphere has created comes from people being less careful than existing academic norms (including Eliezer and Scott Alexander, whose best works are often quite polemic).
In short: I totally think we want more people coming up with good ideas, and that this is a big bottleneck. But there are many different directions in which we should tug people in order to make them more intellectually productive. Many academics should be less careful. Many people on LessWrong should be more careful. Some scientists should be less empirical, others should be more empirical; some less mathematically rigorous, others more mathematically rigorous. Others should try to live in countries that are less repressive of new potentially-crazy ideas (hence politics being important). And then, of course, others should be figuring out how to actually get good ideas implemented.
Meanwhile, Eliezer and Sam and Elon should have had less of a burning desire to found an AGI lab. I agree that this can be described by “wanting to be the hero who saves the world”, but this seems to function as a curiosity stopper for you. When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”). I think of extremely strong motivations to change the world (as these outlier figures have) as typically driven by some kind of core emotional dysregulation. And specifically I think of fear-based motivation as the underlying phenomenon which implements status-seeking and many other behaviors which are harmful when taken too far. (This is not an attempt to replace evo-psych, btw—it’s an account of the implementation mechanisms that evolution used to get us to do the things it wanted, which now are sometimes maladapted to our current environment.) I write about a bunch of these models in my Replacing Fear sequence.