It’s been over a year since the original post and 7 months since the openphil revision.
A top level summary:
My estimates for timelines are pretty much the same as they were.
My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).
And, while I don’t think this is the most surprising outcome nor the most critical detail, it’s probably worth pointing out some context. From NVIDIA:
In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.
From the post:
In 3 years, if NVIDIA’s production increases another 5x …
Revenue isn’t a perfect proxy for shipped compute, but I think it’s safe to say we’ve entered a period of extreme interest in compute acquisition. “5x” in 3 years seems conservative.[1] I doubt the B100 is going to slow this curve down, and competitors aren’t idle: AMD’s MI300X is within striking distance, and even Intel’s Gaudi 2 has promising results.
Chip manufacturing remains a bottleneck, but it’s a bottleneck that’s widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.
On the difficulty of intelligence
I’m torn about the “too much intelligence within bounds” stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying “there’s more important stuff here! I can’t talk about it but trust me bro!”
I’m not sure what to do about this. The best arguments and evidence are things that fall into the bucket of “probably don’t talk about this in public out of an abundance of caution.” It’s not one weird trick to explode the world, but it’s not completely benign either.
Continued research and private conversations haven’t made me less concerned. I do know there are some other people who are worried about similar things, but it’s unclear how widely understood it is, or whether someone has a strong argument against it that I don’t know about.
So, while unsatisfying, I’d still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast’s variance arises from uncertainty on this point; it’s mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post’s original publication date and its median estimate of 2030.
On doom
When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.
We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
I’ve written more on the nice properties of some of these architectures elsewhere. I’m in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.
The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn’t apply.
Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers[2], because hey, once you’ve got an optimizer and you don’t know any bounds on it, you probably shouldn’t assume it’ll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.
For example, this part:
Is that enough to start deeply modeling internal agents and other phenomena concerning for safety?
And this part:
AGI probably isn’t going to suffer from these issues as much. Building an oracle is probably still worth it to a company even if it takes 10 seconds for it to respond, and it’s still worth it if you have to double check its answers (up until oops dead, anyway).
With no justification, I imported deceptive mesaoptimizers and other “unbound” threats. Under the earlier model, this seemed natural.
I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.
As a result, “technical safety failure causes existential catastrophe” dropped in probability by around 75-90%, down to something like 5%-ish.[3]
I’m still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.[4] The “pivotal act” frame gets imported even if none of the systems are independently dangerous.
I’ve got hope that my expected path of capabilities opens the door for more incremental interventions, but there’s a reason my total P(doom) hasn’t yet dropped much below 30%.
A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn’t some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.
It’s sometimes tricky to draw a line between “oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong,” so it’s hard to pin down the constituent probabilities.
While strong AI introduces all sorts of new threats, its generality amplifies “conventional” threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.
It’s been over a year since the original post and 7 months since the openphil revision.
A top level summary:
My estimates for timelines are pretty much the same as they were.
My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).
And, while I don’t think this is the most surprising outcome nor the most critical detail, it’s probably worth pointing out some context. From NVIDIA:
In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.
From the post:
Revenue isn’t a perfect proxy for shipped compute, but I think it’s safe to say we’ve entered a period of extreme interest in compute acquisition. “5x” in 3 years seems conservative.[1] I doubt the B100 is going to slow this curve down, and competitors aren’t idle: AMD’s MI300X is within striking distance, and even Intel’s Gaudi 2 has promising results.
Chip manufacturing remains a bottleneck, but it’s a bottleneck that’s widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.
On the difficulty of intelligence
I’m torn about the “too much intelligence within bounds” stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying “there’s more important stuff here! I can’t talk about it but trust me bro!”
I’m not sure what to do about this. The best arguments and evidence are things that fall into the bucket of “probably don’t talk about this in public out of an abundance of caution.” It’s not one weird trick to explode the world, but it’s not completely benign either.
Continued research and private conversations haven’t made me less concerned. I do know there are some other people who are worried about similar things, but it’s unclear how widely understood it is, or whether someone has a strong argument against it that I don’t know about.
So, while unsatisfying, I’d still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast’s variance arises from uncertainty on this point; it’s mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post’s original publication date and its median estimate of 2030.
On doom
When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.
A brief summary would be:
I’ve written more on the nice properties of some of these architectures elsewhere. I’m in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.
The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn’t apply.
Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers[2], because hey, once you’ve got an optimizer and you don’t know any bounds on it, you probably shouldn’t assume it’ll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.
For example, this part:
And this part:
With no justification, I imported deceptive mesaoptimizers and other “unbound” threats. Under the earlier model, this seemed natural.
I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.
As a result, “technical safety failure causes existential catastrophe” dropped in probability by around 75-90%, down to something like 5%-ish.[3]
I’m still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.[4] The “pivotal act” frame gets imported even if none of the systems are independently dangerous.
I’ve got hope that my expected path of capabilities opens the door for more incremental interventions, but there’s a reason my total P(doom) hasn’t yet dropped much below 30%.
The reason why this isn’t an update for me is that I was being deliberately conservative at the time.
A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn’t some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.
It’s sometimes tricky to draw a line between “oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong,” so it’s hard to pin down the constituent probabilities.
While strong AI introduces all sorts of new threats, its generality amplifies “conventional” threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.