Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
Richard_Ngo
Worse than the current situation, because the counterfactual is that some later project happens which kicks off in a less race-y manner.
In other words, whatever the chance of its motivation shifting over time, it seems dominated by the chance that starting the equivalent project later would just have better motivations from the outset.
Great post. One slightly nitpicky point, though: even in the section where you argue that probabilities are cursed, you are still talking in the language of probabilities (e.g. “my modal guess is that I’m in a solipsist simulation that is a fork of a bigger simulation”).
I think there’s probably a deeper ontological shift you can do to a mindset where there’s no actual ground truth about “where you are”. I think in order to do that you probably need to also go beyond “expected utilities are real”, because expected utilities need to be calculated by assigning credences to worlds and then multiplying them by expected impact in each world.
Instead the most “real” thing here I’d guess is something like “I am an agent in a superposition of being in many places in the multiverse. Each of my actions is a superposition of uncountable trillions of actions that will lead to nothing plus a few that will have lasting causal influence. The degree to which I care about one strand of causal influence over another is determined by the coalitional dynamics of my many subagents”.
FWIW I think this is roughly the perspective on the multiverse Yudkowsky lays out in Planecrash (especially in the bits near the end where Keltham and Carissa discuss anthropics). Except that the degrees of caring being determined by coalitional dynamics is more related to geometric rationality.
I also tweeted about something similar recently (inspired by your post).
Cool, ty for (characteristically) thoughtful engagement.
I am still intuitively skeptical about a bunch of your numbers but now it’s the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).
Ty for the link but these seem like both clearly bad semantics (e.g. under either of these the second-best hypothesis under consideration might score arbitrarily badly).
Just changed the name to The Minority Coalition.
1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It’s a step towards applying outside view, though not fully “outside view”.)
2. Yepp, agreed, though I think the common-sense connotations of “if this became” or “this would have a big effect” are causal, especially in the context where we’re talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you “I’ll be more optimistic about your health if you take these pills”, and so you take the pills, and then I say “well the pills do nothing but now I’m more optimistic, because you’re the sort of person who’s willing to listen to recommendations”. True, but it also undermines people’s willingness/incentive to listen to my claims about what would make the world better.)
3. Here are ten that affect AI risk as much one way or the other:
The US government “waking up” a couple of years earlier or later (one operationalization: AISIs existing or not right now).
The literal biggest names in the field of AI becoming focused on AI risk.
The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
Trump winning the election.
Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you’ve made).
o1-style reasoning as the next big breakthrough.
Takeoff speeds (whatever updates you’ve made in the last three years).
China’s trajectory of AI capabilities (whatever updates you’ve made about that in last 3 years).
China’s probability of invading Taiwain (whatever updates you’ve made about that in last 3 years).
And then I think in 3 years we’ll be able to publish a similar list of stuff that mostly we just hadn’t predicted or thought about before now.
I expect you’ll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).
But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don’t really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that’s comparable to or smaller than the things above.
I think I would be more sympathetic to your view if the claim were “if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit”. That would probably halve my P(doom), it’s just a very very strong criterion.
We have discussed this dynamic before but just for the record:
I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.
This is IMO not the type of change that should be able to cut someone’s P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halve your P(doom), then your P(doom) should be oscillating wildly all the time.
I flag this as an example of prioritizing inside-view considerations too strongly in forecasts. I think this is the sort of problem that arises when you “take bayesianism too seriously”, which is one of the reasons why I wrote my recent post on why I’m not a bayesian (and also my earlier post on Knightian uncertainty).
For context: our previous discussions about this related to Daniel’s claim that appointing one specific person to one specific important job could change his P(doom) by double digit percentage points. I similarly think this is not the type of consideration that should be able to swing people’s P(doom) that much (except maybe changing the US or Chinese leaders, but we weren’t talking about those).
Lastly, since this is a somewhat critical comment, I should flag that I really appreciate and admire Daniel’s forecasting, have learned a lot from him, and think he’s generally a great guy. The epistemology disagreements just disproportionately bug me.
The former can be sufficient—e.g. there are good theoretical researchers who have never done empirical work themselves.
In hindsight I think “close conjunction” was too strong—it’s more about picking up the ontologies and key insights from empirical work, which can be possible without following it very closely.
I think there’s something importantly true about your comment, but let me start with the ways I disagree. Firstly, the more ways in which you’re power-seeking, the more defense mechanisms will apply to you. Conversely, if you’re credibly trying to do a pretty narrow and widely-accepted thing, then there will be less backlash. So Jane Street is power-seeking in the sense of trying to earn money, but they don’t have much of a cultural or political agenda, they’re not trying to mobilize a wider movement, and earning money is a very normal thing for companies to do, it makes them one of thousands of comparably-sized companies. (Though note that there is a lot of backlash against companies in general, which are perceived to have too much power. This leads a wide swathe of people, especially on the left, and especially in Europe, to want to greatly disempower companies because they don’t trust them.)
Meanwhile the Gates Foundation has a philanthropic agenda, but like most foundations tries to steer clear of wider political issues, and also IIRC tries to focus on pretty object-level and widely-agreed-to-be-good interventions. Even so, it’s widely distrusted and feared, and Gates has become a symbol of hated global elites, to the extent where there are all sorts of conspiracy theories about him. That’d be even worse if the foundation were more political.
Lastly, it seems a bit facile to say that everyone hates Goldman due to “perceived greed rather than power-seeking per se”. A key problem is that people think of the greed as manifesting through political capture, evading regulatory oversight, deception, etc. That’s part of why it’s harder to tar entrepreneurs as greedy: it’s just much clearer that their wealth was made in legitimate ways.
Now the sense in which I agree: I think that “gaining power triggers to defense mechanisms” is a good first pass, but also we definitely want a more mechanistic explanation of what the defense mechanisms are, what triggers them, etc—in particular so we don’t just end up throwing our hands in the air and concluding that doing anything is hopeless and scary. And I also agree that your list is a good start. So maybe I’d just want to add to it stuff like:
having a broad-ranging political agenda (that isn’t near-universally agreed to be good)
having non-transparent interactions with many other powerful actors
having open-ended scope to expand
And maybe a few others (open to more suggestions).
The bits are not very meaningful in isolation; the claim “program-bit number 37 is a 1” has almost no meaning in the absence of further information about the other program bits. However, this isn’t much of an issue for the formalism.
In my post I defend the use of propositions as a way to understand models, and attack the use of propositions as a way to understand reality. You can think of this as a two-level structure: claims about models can be crisp and precise enough that it makes sense to talk about them in propositional terms, but for complex bits of reality you mostly want to make claims of the form “this is well-modeled by model X”. Those types of claims need to be understood in terms of continuous truth-values: they’re basically never entirely true or entirely false.
Separately, Solomonoff programs are non-central examples of models because they do not come with structural correspondences to reality attached (except via their inputs and outputs). Most models have some mapping that allows you to point at program-bits and infer some features of reality from them.
I notice as I write this that there’s some tension in my position: I’m saying we shouldn’t apply propositions to reality, but also the mappings I mentioned above allow us to formulate propositions like “the value of X in reality is approximately the value of this variable in my model”.
So maybe actually I’m actually arguing for a middle ground between two extremes:
The basic units of epistemology should all map precisely to claims about reality, and should be arbitrarily combinable and composable (the propositional view)
The basic units of epistemology should only map to claims about reality in terms of observable predictions, and not be combinable or composable at all (the Solomonoff view)
This spectrum isn’t fully well-defined even in my head but seems like an interesting way to view things which I’ll think more about.
The minority faction is the group of entities that are currently alive, as opposed to the vast number of entities that will exist in the future. I.e. the one Clarke talks about when he says “why won’t you help the rest of us form a coalition against them?”
In hindsight I should probably have called it The Minority Coalition.
Here’s how that would be handled by a Bayesian mind:
There’s some latent variable representing the semantics of “humanity will be extinct in 100 years”; call that variable S for semantics.
Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
… and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
What would resolve the uncertainty that remains after you have conditioned on the entire low-level state of the physical world? (I assume that we’re in the logically omniscient setting here?)
“Dragons are attacking Paris!” seems true by your reasoning, since there are no dragons, and therefore it is vacuously true that all of them are attacking Paris.
Ty for the comment. I mostly disagree with it. Here’s my attempt to restate the thrust of your argument:
The issues with binary truth-values raised in the post are all basically getting at the idea that the meaning of a proposition is context-dependent. But we can model context-dependence in a Bayesian way by referring to latent variables in the speaker’s model of the world. Therefore we don’t need fuzzy truth-values.
But this assumes that, given the speaker’s probabilistic model, truth-values are binary. I don’t see why this needs to be the case. Here’s an example: suppose my non-transhumanist friend says “humanity will be extinct in 100 years”. And I say “by ‘extinct’ do you include genetically engineered until future humans are a different species? How about being uploaded? How about all being cryonically frozen, to be revived later? How about....”
In this case, there is simply no fact of the matter about which of these possibilities should be included or excluded in the context of my friend’s original claim, because (I’ll assume) they hadn’t considered any of those possibilities.
More prosaically, even if I have considered some possibilities in the past, at the time when I make a statement I’m not actively considering almost any of them. For some of them, if you’d raised those possibilities to me when I’d asked the question, I’d have said “obviously I did/didn’t mean to include that”, but for others I’d have said “huh, idk” and for others still I would have said different things depending on how you presented them to me. So what reason do we have to think that there’s any ground truth about what the context does or doesn’t include? Similar arguments apply re approximation error about how far away the grocery store is: clearly 10km error is unacceptable, and 1m is acceptable, but what reason do we have to think that any “correct” threshold can be deduced even given every fact about my brain-state when I asked the question?
I picture you saying in response to this “even if there are some problems with binary truth-values, fuzzy truth-values don’t actually help very much”. To this I say: yes, in the context of propositions, I agree. But that’s because we shouldn’t be doing epistemology in terms of propositions. And so you can think of the logical flow of my argument as:
Here’s why, even for propositions, binary truth is a mess. I’m not saying I can solve it but this section should at least leave you open-minded about fuzzy truth-values.
Here’s why we shouldn’t be thinking in terms of propositions at all, but rather in terms of models.
And when it comes to models, something like fuzzy truth-values seems very important (because it is crucial to be able to talk about models being closer to the truth without being absolutely true or false).
I accept that this logical flow wasn’t as clear as it could have been. Perhaps I should have started off by talking about models, and only then introduced fuzzy truth-values? But I needed the concept of fuzzy truth-values to explain why models are actually different from propositions at all, so idk.
I also accept that “something like fuzzy truth-values” is kinda undefined here, and am mostly punting that to a successor post.
Suppose you have two models of the earth; one is a sphere, one is an ellipsoid. Both are wrong, but they’re wrong in different ways. Now, we can operationalize a bunch of different implications of these hypotheses, but most of the time in science the main point of operationalizing the implications is not to choose between two existing models, or because we care directly about the operationalizations, but rather to come up with a new model that combines their benefits.
Why I’m not a Bayesian
IMO all of the “smooth/sharp” and “soft/hard” stuff is too abstract. When I concretely picture what the differences between them are, the aspect that stands out most is whether the takeoff will be concentrated within a single AI/project/company/country or distributed across many AIs/projects/companies/countries.
This is of course closely related to debates about slow/fast takeoff (as well as to the original Hanson/Yudkowsky debates). But using this distinction instead of any version of the slow/fast distinction has a few benefits:
If someone asks “why should I care about slow/fast takeoff?” a lot of the answers will end up appealing to the concentrated/distributed power thing. E.g. you might say “if takeoff is fast that means that there will be a few key points of leverage”.
Being more concrete, I think it will provoke better debates (e.g. how would a single AI lab concretely end up outcompeting everyone else?)
This framing naturally concentrates the mind on an aspect of risk (concentration of power) that is concerning from both a misuse and a misalignment perspective.
Well, the whole point of national parks is that they’re always going to be unproductive because you can’t do stuff in them.
If you mean in terms of extracting raw resources, maybe (though presumably a bunch of mining/logging etc in national parks could be pretty valuable) but either way it doesn’t matter because the vast majority of economic productivity you could get from them (e.g. by building cities) is banned.
Nothing makes humans all that special
This is just false. Humans are at the very least privileged in our role as biological bootloaders of AI. The emergence of written culture, industrial technology, and so on, are incredibly special from a historical perspective.
You only set aside occasional low-value fragments for national parks, mostly for your own pleasure and convenience, when it didn’t cost too much?
Earth as a proportion of the solar system’s planetary mass is probably comparable to national parks as a proportion of the Earth’s land, if not lower.
I don’t think this line of argument is a good one. If there’s a 5% chance of x-risk and, say, a 50% chance that AGI makes the world just generally be very chaotic and high-stakes over the next few decades, then it seems very plausible that you should mostly be optimizing for making the 50% go well rather than the 5%.