Subtitle: A partial defense of high-confidence AGI doom predictions.
Introduction
Consider these two kinds of accident scenarios:
In a default-success scenario, accidents are rare. For example, modern aviation is very safe thanks to decades of engineering efforts and a safety culture (e.g. the widespread use of checklists). When something goes wrong, it is often due to multiple independent failures that combine to cause a disaster (e.g. bad weather + communication failures + pilot not following checklist correctly).
In a default-failure scenario, accidents are the norm. For example, when I write a program to do something I haven’t done many times already, it usually fails the first time I try it. It then goes on to fail the second time and the third time as well. Here, failure on the first try is overdetermined―even if I fix the first bug, the second bug is still, independently, enough to cause the program to crash. This is typical in software engineering, and it can take many iterations and tests to move into the default-success regime.
Default-success scenarios include most engineering tasks that we have lots of experience with and know how to do well: building bridges, building skyscrapers, etc. Default-failure scenarios, as far as I can tell, come in two kinds: scenarios in which we’re trying to do something for the first time (rocket test launches, prototypes, new technologies) and scenarios in which there is a competent adversary that is trying to break the system, as in computer security.[1]
Predictions on AGI risk
In the following, I use P(doom) to refer to the probability of an AGI takeover and / or human extinction due to the development of AGI.
I often encounter the following argument against predictions of AGI catastrophes:
Alice: We seem to be on track to build an AGI smarter than humans. We don’t know how to solve the technical problem of building an AGI we can control, or the political problem of convincing people to not build AGI. Every plausible scenario I’ve ever thought or heard of leads to AGI takeover. In my estimate, P(doom) is [high number].
Bob: I disagree. It’s overconfident to estimate high P(doom). Humans are usually bad at predicting the future, especially when it comes to novel technologies like AGI. When you account for how uncertain your predictions are, your estimate should be at most [low number].”
I’m being vague about the numbers because I’ve seen Bob’s argument made in many different situations. In one recent conversation I witnessed, the Bob-Alice split was P(doom) 0.5% vs. ~10%, and in another discussion it was 10% vs. 90%.
My main claim is that Alice and Bob don’t actually disagree about how uncertain or hard to predict the future is―instead, they disagree about to what degree AGI risk is default-success vs. default-failure. If AGI risk is (mostly) default-failure, then uncertainty is a reason for pessimism rather than optimism, and Alice is right to predict failure.
In this sense I think Bob is missing the point. Bob claims that Alice is not sufficiently uncertain about her AI predictions, or has not integrated her uncertainty into her estimate well enough. This is not necessarily true; it may just be that Alice’s uncertainty about her reasoning doesn’t make her much more optimistic.
Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).
Uncertainty does not (necessarily) mean you should be more optimistic
Many people are skeptical of the ‘default-failure’ frame, so I’ll give a bit more color here by listing some reasons why I think Bob’s argument is wrong / unproductive. I won’t go into detail about why AGI risk specifically might be a default-failure scenario; you can find a summary of those arguments in Nate Soares’ post on why AGI ruin is likely.
It’s true that the future is often hard to predict; for example, experts often fail to predict technological developments. This is not a reason for optimism. It would be kind of weird if it was! Humans are generally bad at predicting the future, especially for technological progress, and this is bad news for AI safety.
In particular: if all the AI researchers are uncertain about what will happen, that is a bad sign much in the same way that it would be a bad sign if none of your security engineers understood the system they are supposed to secure.
Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
To believe P(doom) is high, all you really need to be convinced of is that the default outcome for messing up superhuman AGI is human extinction, and that we’re not prepared. Our understanding here is incomplete but still relatively good compared to details that are harder to predict, e.g. when exactly AGI will arrive or what early forms of AGI will look like.
It is not always wrong to make high-confidence disaster predictions. For example, people saying “covid will be a disaster with high (~90%) probability” in February 2020 were predictably correct, even though covid was a very novel situation. There was a lot of uncertainty, and the people who predicted disaster usually got the details wrong like everyone else, but the overall picture was still correct because the details didn’t matter much.
A confidence of 90% is not actually much harder to achieve than 10%, relative to the baseline extinction risk for a new technology which is close to 0%. An estimate of P(doom) = 30% already leans very heavily on your inside view of the risks involved; you don’t need to trust your reasoning all that much more to estimate 90% instead.
Put differently: there’s no reason in particular why Bob’s uncertainty argument should cap your confidence at ~80%, rather than 1% or 0.1%.
(It seems totally reasonable to me for a first reaction to AI X-risk to be “Eh I don’t know, it’s an interesting idea and I’ll think more on it, but it does seem pretty crazy; if I had to estimate P(doom) right now I would say ~0.1%, though I would prefer not to give a number at all.” Followed, to be clear, by rapid updates in favor of high p(doom), though not necessarily 90%; I think 90% makes sense for people who have slammed their head against the difficulties involved, and noticed a pattern where the wall they’re slamming their heads against is pretty hard and doesn’t have visible weak spots; but otherwise you wouldn’t necessarily be that pessimistic.)
More generally: estimates around 90% aren’t all that “confident”. If you’re well-calibrated, changing your mind about something that you estimate to be 90% likely is something that happens all the time. So P(X) = 90% means “I expect X to happen, though I’m happy to change my mind and in fact regularly do change my mind about claims like this”.
It makes sense to be uncertain about your beliefs, and about whether you thought of all the relevant things (usually you didn’t). Rather than be generically uncertain about everything, it’s usually better to be uncertain about specific parts of your model.
For example: I’m uncertain about the behavior and capability profile of the first AI that surpasses humans in scientific research. This makes me more pessimistic about alignment relative to a baseline where I was certain, because any strategy that depends on specific assumptions about the capabilities of this AI is unlikely to work.
For a second example: I think there probably won’t be any international ban or regulation on large training runs that lengthens timelines by >10 years, but I’m pretty uncertain. This makes me more optimistic relative to a baseline where I was certain governments would do nothing.
Put differently: most of your uncertainty about beliefs should be part of your model, not some external thing that magically pushes all your beliefs towards 50% or 0% or 100%.
Some things I’m not saying
This part is me hedging my claims. Feel free to skip if that seems like a boring thing to read.
I don’t personally estimate P(doom) above 90%.
I’m also not saying there are no reasons to be optimistic. I’m claiming that reasons for optimism should usually be concrete arguments about possible ways to avoid doom. For example, Paul Christiano argues for a somewhat lower than 90% P(doom) here, and I think the general shape of his argument makes sense, in contrast to Bob’s above.
I do think there is a correct version of the argument that, if your model says P(outcome) = 0.99, model uncertainty will generally be a reason to update downwards. I think people already take that into account when stating high P(doom) estimates. Here’s a sketch of a plausible reasoning (summarized and not my numbers, but I do have similar reasoning, and I don’t think the numbers are crazy):
Almost every time I imagine a concrete scenario for how AGI development might go, that leads to an outcome where humans go extinct.
I can imagine some ways in which things go well, but they seem pretty fanciful; for example a sudden international treaty that forbids large training runs and successfully enforces this. (I do expect there’ll be other government efforts, but I don’t expect those to change things much for the better). So my “within-model” prediction is p(doom) = 0.99.
My model is almost certainly wrong. Sadly, for most scenarios I can imagine, being wrong would only make things worse. I’m literally a safety researcher; me being totally wrong about e.g. what the first AGI looks like is not a good sign for safety (and I don’t expect other safety researchers to have better models). Almost all surprises are bad.
Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
That said: while technical surprises are probably bad, there’s other kinds of positive surprises we could get, for example: more progress on AI safety than expected, better interpretability methods, more uptake of AI risk concerns by the broader ML community, more government action on regulating AI.
In fact, there are some kinds of cumulative surprises that could add up to save us; as an example, enough regulation of AI could lead to ~10y longer timelines; more progress than expected in interpretability could lead to more compelling demonstrations of misalignment; more uptake of AI risk by the broader scientific community might lead to more safety progress and an overall more careful approach to AGI.
Note that this is not an update made from pure uncertainty―there is a concrete story here about how exactly surprises might actually be helpful, rather than bad. It’s not a particularly great story either; it needs many things to go better than expected.
Now, that particular story is not likely at all. But it seems like there are many stories in that general category, such that the total likelihood of a good surprise adds up to 10%.
Note the basic expectation of ‘surprises are often bad’ still applies. Not knowing how governments or society will react to AI is hardly helpful for the people who are currently trying to get governments or society to react in a useful way.
So my overall, all-things-considered p(doom) is 90%, mostly due to a kind of sketchy downwards-update due to model uncertainty, without which the estimate would be around 99%.
It’s debatable how large the downwards update here should be―it could reasonably be more or less than 10%, and it’s plausible that we’re in the kind of domain where small quantified probability updates aren’t very useful at all.
I don’t mean to say that the reasoning here is the only reasonable version out there. It depends a lot on how likely you think various definitely-useful surprises are, like long timelines to AGI and slow progress after proto-AGI. But I do think it is wrong to call high P(doom) estimates overconfident without any further more detailed criticism.
Finally, I haven’t given an explicit argument for AGI risk; there’s a lot of that elsewhere.
Uncertainty about the future does not imply that AGI will go well
Subtitle: A partial defense of high-confidence AGI doom predictions.
Introduction
Consider these two kinds of accident scenarios:
In a default-success scenario, accidents are rare. For example, modern aviation is very safe thanks to decades of engineering efforts and a safety culture (e.g. the widespread use of checklists). When something goes wrong, it is often due to multiple independent failures that combine to cause a disaster (e.g. bad weather + communication failures + pilot not following checklist correctly).
In a default-failure scenario, accidents are the norm. For example, when I write a program to do something I haven’t done many times already, it usually fails the first time I try it. It then goes on to fail the second time and the third time as well. Here, failure on the first try is overdetermined―even if I fix the first bug, the second bug is still, independently, enough to cause the program to crash. This is typical in software engineering, and it can take many iterations and tests to move into the default-success regime.
See also: conjuctive vs disjunctive risk scenarios.
Default-success scenarios include most engineering tasks that we have lots of experience with and know how to do well: building bridges, building skyscrapers, etc. Default-failure scenarios, as far as I can tell, come in two kinds: scenarios in which we’re trying to do something for the first time (rocket test launches, prototypes, new technologies) and scenarios in which there is a competent adversary that is trying to break the system, as in computer security.[1]
Predictions on AGI risk
In the following, I use P(doom) to refer to the probability of an AGI takeover and / or human extinction due to the development of AGI.
I often encounter the following argument against predictions of AGI catastrophes:
Alice: We seem to be on track to build an AGI smarter than humans. We don’t know how to solve the technical problem of building an AGI we can control, or the political problem of convincing people to not build AGI. Every plausible scenario I’ve ever thought or heard of leads to AGI takeover. In my estimate, P(doom) is [high number].
Bob: I disagree. It’s overconfident to estimate high P(doom). Humans are usually bad at predicting the future, especially when it comes to novel technologies like AGI. When you account for how uncertain your predictions are, your estimate should be at most [low number].”
I’m being vague about the numbers because I’ve seen Bob’s argument made in many different situations. In one recent conversation I witnessed, the Bob-Alice split was P(doom) 0.5% vs. ~10%, and in another discussion it was 10% vs. 90%.
My main claim is that Alice and Bob don’t actually disagree about how uncertain or hard to predict the future is―instead, they disagree about to what degree AGI risk is default-success vs. default-failure. If AGI risk is (mostly) default-failure, then uncertainty is a reason for pessimism rather than optimism, and Alice is right to predict failure.
In this sense I think Bob is missing the point. Bob claims that Alice is not sufficiently uncertain about her AI predictions, or has not integrated her uncertainty into her estimate well enough. This is not necessarily true; it may just be that Alice’s uncertainty about her reasoning doesn’t make her much more optimistic.
Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).
Uncertainty does not (necessarily) mean you should be more optimistic
Many people are skeptical of the ‘default-failure’ frame, so I’ll give a bit more color here by listing some reasons why I think Bob’s argument is wrong / unproductive. I won’t go into detail about why AGI risk specifically might be a default-failure scenario; you can find a summary of those arguments in Nate Soares’ post on why AGI ruin is likely.
It’s true that the future is often hard to predict; for example, experts often fail to predict technological developments. This is not a reason for optimism. It would be kind of weird if it was! Humans are generally bad at predicting the future, especially for technological progress, and this is bad news for AI safety.
In particular: if all the AI researchers are uncertain about what will happen, that is a bad sign much in the same way that it would be a bad sign if none of your security engineers understood the system they are supposed to secure.
Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
To believe P(doom) is high, all you really need to be convinced of is that the default outcome for messing up superhuman AGI is human extinction, and that we’re not prepared. Our understanding here is incomplete but still relatively good compared to details that are harder to predict, e.g. when exactly AGI will arrive or what early forms of AGI will look like.
It is not always wrong to make high-confidence disaster predictions. For example, people saying “covid will be a disaster with high (~90%) probability” in February 2020 were predictably correct, even though covid was a very novel situation. There was a lot of uncertainty, and the people who predicted disaster usually got the details wrong like everyone else, but the overall picture was still correct because the details didn’t matter much.
A confidence of 90% is not actually much harder to achieve than 10%, relative to the baseline extinction risk for a new technology which is close to 0%. An estimate of P(doom) = 30% already leans very heavily on your inside view of the risks involved; you don’t need to trust your reasoning all that much more to estimate 90% instead.
Put differently: there’s no reason in particular why Bob’s uncertainty argument should cap your confidence at ~80%, rather than 1% or 0.1%.
(It seems totally reasonable to me for a first reaction to AI X-risk to be “Eh I don’t know, it’s an interesting idea and I’ll think more on it, but it does seem pretty crazy; if I had to estimate P(doom) right now I would say ~0.1%, though I would prefer not to give a number at all.” Followed, to be clear, by rapid updates in favor of high p(doom), though not necessarily 90%; I think 90% makes sense for people who have slammed their head against the difficulties involved, and noticed a pattern where the wall they’re slamming their heads against is pretty hard and doesn’t have visible weak spots; but otherwise you wouldn’t necessarily be that pessimistic.)
More generally: estimates around 90% aren’t all that “confident”. If you’re well-calibrated, changing your mind about something that you estimate to be 90% likely is something that happens all the time. So P(X) = 90% means “I expect X to happen, though I’m happy to change my mind and in fact regularly do change my mind about claims like this”.
It makes sense to be uncertain about your beliefs, and about whether you thought of all the relevant things (usually you didn’t). Rather than be generically uncertain about everything, it’s usually better to be uncertain about specific parts of your model.
For example: I’m uncertain about the behavior and capability profile of the first AI that surpasses humans in scientific research. This makes me more pessimistic about alignment relative to a baseline where I was certain, because any strategy that depends on specific assumptions about the capabilities of this AI is unlikely to work.
For a second example: I think there probably won’t be any international ban or regulation on large training runs that lengthens timelines by >10 years, but I’m pretty uncertain. This makes me more optimistic relative to a baseline where I was certain governments would do nothing.
Put differently: most of your uncertainty about beliefs should be part of your model, not some external thing that magically pushes all your beliefs towards 50% or 0% or 100%.
Some things I’m not saying
This part is me hedging my claims. Feel free to skip if that seems like a boring thing to read.
I don’t personally estimate P(doom) above 90%.
I’m also not saying there are no reasons to be optimistic. I’m claiming that reasons for optimism should usually be concrete arguments about possible ways to avoid doom. For example, Paul Christiano argues for a somewhat lower than 90% P(doom) here, and I think the general shape of his argument makes sense, in contrast to Bob’s above.
I do think there is a correct version of the argument that, if your model says P(outcome) = 0.99, model uncertainty will generally be a reason to update downwards. I think people already take that into account when stating high P(doom) estimates. Here’s a sketch of a plausible reasoning (summarized and not my numbers, but I do have similar reasoning, and I don’t think the numbers are crazy):
Almost every time I imagine a concrete scenario for how AGI development might go, that leads to an outcome where humans go extinct.
I can imagine some ways in which things go well, but they seem pretty fanciful; for example a sudden international treaty that forbids large training runs and successfully enforces this. (I do expect there’ll be other government efforts, but I don’t expect those to change things much for the better). So my “within-model” prediction is p(doom) = 0.99.
My model is almost certainly wrong. Sadly, for most scenarios I can imagine, being wrong would only make things worse. I’m literally a safety researcher; me being totally wrong about e.g. what the first AGI looks like is not a good sign for safety (and I don’t expect other safety researchers to have better models). Almost all surprises are bad.
Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
That said: while technical surprises are probably bad, there’s other kinds of positive surprises we could get, for example: more progress on AI safety than expected, better interpretability methods, more uptake of AI risk concerns by the broader ML community, more government action on regulating AI.
In fact, there are some kinds of cumulative surprises that could add up to save us; as an example, enough regulation of AI could lead to ~10y longer timelines; more progress than expected in interpretability could lead to more compelling demonstrations of misalignment; more uptake of AI risk by the broader scientific community might lead to more safety progress and an overall more careful approach to AGI.
Note that this is not an update made from pure uncertainty―there is a concrete story here about how exactly surprises might actually be helpful, rather than bad. It’s not a particularly great story either; it needs many things to go better than expected.
Now, that particular story is not likely at all. But it seems like there are many stories in that general category, such that the total likelihood of a good surprise adds up to 10%.
Note the basic expectation of ‘surprises are often bad’ still applies. Not knowing how governments or society will react to AI is hardly helpful for the people who are currently trying to get governments or society to react in a useful way.
So my overall, all-things-considered p(doom) is 90%, mostly due to a kind of sketchy downwards-update due to model uncertainty, without which the estimate would be around 99%.
It’s debatable how large the downwards update here should be―it could reasonably be more or less than 10%, and it’s plausible that we’re in the kind of domain where small quantified probability updates aren’t very useful at all.
I don’t mean to say that the reasoning here is the only reasonable version out there. It depends a lot on how likely you think various definitely-useful surprises are, like long timelines to AGI and slow progress after proto-AGI. But I do think it is wrong to call high P(doom) estimates overconfident without any further more detailed criticism.
Finally, I haven’t given an explicit argument for AGI risk; there’s a lot of that elsewhere.
Note how AGI somehow manages to satisfy both of these criteria at once.