Suppose the US government pursued a “Manhattan Project for AGI”. At its onset, it’s primarily fuelled by a desire to beat China to AGI. However, there’s some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.)
Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
My own impression is that this would be an improvement over the status quo. Main reasons:
A lot of my P(doom) comes from race dynamics.
Right now, if a leading lab ends up realizing that misalignment risks are super concerning, they can’t do much to end the race. Their main strategy would be to go to the USG.
If the USG runs the Manhattan Project (or there’s some sort of soft nationalization in which the government ends up having a much stronger role), it’s much easier for the USG to see that misalignment risks are concerning & to do something about it.
A national project would be more able to slow down and pursue various kinds of international agreements (the national project has more access to POTUS, DoD, NSC, Congress, etc.)
I expect the USG to be stricter on various security standards. It seems more likely to me that the USG would EG demand a lot of security requirements to prevent model weights or algorithmic insights from leaking to China. One of my major concerns is that people will want to pause at GPT-X but they won’t feel able to because China stole access to GPT-Xminus1 (or maybe even a slightly weaker version of GPT-X).
In general, I feel like USG natsec folks are less “move fast and break things” than folks in SF. While I do think some of the AGI companies have tried to be less “move fast and break things” than the average company, I think corporate race dynamics & the general cultural forces have been the dominant factors and undermined a lot of attempts at meaningful corporate governance.
(Caveat that even though I see this as a likely improvement over status quo, this doesn’t mean I think this is the best thing to be advocating for.)
(Second caveat that I haven’t thought about this particular question very much and I could definitely be wrong & see a lot of reasonable counterarguments.)
As you know, I have huge respect for USG natsec folks. But there are (at least!) two flavors of them: 1) the cautious, measure-twice-cut-once sort that have carefully managed deterrencefor decades, and 2) the “fuck you, I’m doing Iran-Contra” folks. Which do you expect will get in control of such a program ? It’s not immediately clear to me which ones would.
@davekasten I know you posed this question to us, but I’ll throw it back on you :) what’s your best-guess answer?
Or perhaps put differently: What do you think are the factors that typically influence whether the cautious folks or the non-cautious folks end up in charge? Are there any historical or recent examples of these camps fighting for power over an important operation?
Why is the built-in assumption for almost every single post on this site that alignment is impossible and we need a 100 year international ban to survive? This does not seem particularly intellectually honest to me. It is very possible no international agreement is needed. Alignment may turn out to be quite tractable.
A mere 5% chance that the plane will crash during your flight is consistent with considering this extremely concerning and doing anything in your power to avoid getting on it. “Alignment is impossible” is not necessary for great concern, isn’t implied by great concern.
I don’t think this line of argument is a good one. If there’s a 5% chance of x-risk and, say, a 50% chance that AGI makes the world just generally be very chaotic and high-stakes over the next few decades, then it seems very plausible that you should mostly be optimizing for making the 50% go well rather than the 5%.
Still consistent with great concern. I’m pointing out that O O’s point isn’t locally valid, observing concern shouldn’t translate into observing belief that alignment is impossible.
Yudkowsky has a pinned tweet that states the problem quite well: it’s not so much that alignment is necessarily infinitely difficult, but that it certainly doesn’t seem anywhere as easy as advancing capabilities, and that’s a problem when what matters is whether the first powerful AI is aligned:
Safely aligning a powerful AI will be said to be ‘difficult’ if that work takes two years longer or 50% more serial time, whichever is less, compared to the work of building a powerful AI without trying to safely align it.
Another frame: If alignment turns out to be easy, then the default trajectory seems fine (at least from an alignment POV. You might still be worried about EG concentration of power).
If alignment turns out to be hard, then the policy decisions we make to affect the default trajectory matter a lot more.
This means that even if misalignment risks are relatively low, a lot of value still comes from thinking about worlds where misalignment is hard (or perhaps “somewhat hard but not intractably hard”).
It’s not every post, but there are still a lot of people who think that alignment is very hard.
The more common assumption is that we should assume that alignment isn’t trivial, because an intellectually honest assessment of the range of opinions suggests that we collectively do not yet know how hard alignment will be.
If the project was fueled by a desire to beat China, the structure of the Manhattan project seems unlikely to resemble the parts of the structure of the Manhattan project that seemed maybe advantageous here, like having a single government-controlled centralized R&D effort.
My guess is if something like this actually happens, it would involve a large number of industry subsidies, and would create strong institutional momentum that even when things got dangerous, to push the state of the art forward, and in as much as there is pushback, continue dangerous development in secret.
In the case of nuclear weapons the U.S. really went very far under the advisement of Edward Teller, so I think the outside view here really doesn’t look good:
Good points. Suppose you were on a USG taskforce that had concluded they wanted to go with the “subsidy model”, but they were willing to ask for certain concessions from industry.
Are there any concessions/arrangements that you would advocate for? Are there any ways to do the “subsidy model” well, or do you think the model is destined to fail even if there were a lot of flexibility RE how to implement it?
I think “full visibility” seems like the obvious thing to ask for, and something that could maybe improve things. Also, preventing you from selling your products to the public, and basically forcing you to sell your most powerful models only to the government, gives the government more ability to stop things when it comes to it.
I will think more about this, I don’t have any immediate great ideas.
I have an answer to that: making sure that NIST:AISI had at least scores of automated evals for checkpoints of any new large training runs, as well as pre-deployment eval access.
Seems like a pretty low-cost, high-value ask to me. Even if that info leaked from AISI, it wouldn’t give away corporate algorithmic secrets.
A higher cost ask, but still fairly reasonable, is pre-deployment evals which require fine-tuning. You can’t have a good sense of a what the model would be capable of in the hands of bad actors if you don’t test fine-tuning it on hazardous info.
Worse than the current situation, because the counterfactual is that some later project happens which kicks off in a less race-y manner.
In other words, whatever the chance of its motivation shifting over time, it seems dominated by the chance that starting the equivalent project later would just have better motivations from the outset.
Can you say more about scenarios where you envision a later project happening that has different motivations?
I think in the current zeitgeist, such a project would almost definitely be primarily motivated by beating China. It doesn’t seem clear to me that it’s good to wait for a new zeitgeist. Reasons:
A company might develop AGI (or an AI system that is very good at AI R&D that can get to AGI) before a major zeitgeist change.
The longer we wait, the more capable the “most capable model that wasn’t secured” is. So we could risk getting into a scenario where people want to pause but since China and the US both have GPT-Nminus1, both sides feel compelled to race forward (whereas this wouldn’t have happened if security had kicked off sooner.)
One factor is different incentives for decision-makers. The incentives (and the mindset) for tech companies is to move fast and break things. The incentives (and mindset) for government workers is usually vastly more conservative.
So if it is the government making decisions about when to test and deploy new systems, I think we’re probably far better off WRT caution.
That must be weighed against the government typically being very bad at technical matters. So even an attempt to be cautious could be thwarted by lack of technical understanding of risks.
Of course, the Trump administration is attempting to instill a vastly different mindset, more like tech companies. So if it’s that administration we’re talking about, we’re probably worse off on net with a combination of lack of knowledge and YOLO attitudes. Which is unfortunate—because this is likely to happen anyway.
As Habryka and others have noted, it also depends on whether it reduces race dynamics by aggregating efforts across companies, or mostly just throws funding fuel on the race fire.
I think this is a (c) leaning (b), especially given that we’re doing it in public. Remember, the Manhattan Project was a highly-classified effort and we know it by an innocuous name given to it to avoid attention.
Saying publicly, “yo, China, we view this as an all-costs priority, hbu” is a great way to trigger a race with China...
But if it turned out that we knew from ironclad intel with perfect sourcing that China was already racing (I don’t expect this to be the case), then I would lean back more towards (c).
What do you think are the most important factors for determining if it results in them behaving responsibly later?
For instance, if you were in charge of designing the AI Manhattan Project, are there certain things you would do to try to increase the probability that it leads to the USG “behaving more responsibly later?”
The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard’s reasons.
Given [new potential-to-shift-motivation information/understanding], I expect there’s a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
Specifically:
Who gets picked to run such a project? If it’s primarily a [let’s beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who’re cautious and highly adaptable?
Here I note that the kind of ‘caution’ we’d need is [people who push effectively for the system to operate with caution]. Most people who want caution are more cautious.
How is the project structured? Will the structure be optimized for adaptability? For red-teaming of top-level goals?
Suppose that a mid-to-high-level participant receives information making the current top-level goals questionable—is the setup likely to reward them for pushing for changes? (noting that these are the kind of changes that were not expected to be needed when the project launched)
Which external advisors do leaders of the project develop relationships with? What would trigger these to change?
...
I do think that it makes sense to aim for some centralized project—but only if it’s the right kind.
I expect that almost all the directional influence is in [influence the initial conditions].
For this reason, I expect [push for some kind of centralized project, and hope it changes later] is a bad idea.
I think [devote great effort to influencing the likely initial direction of any such future project] seems a great idea (so long as you’re sufficiently enlightened about desirable initial directions, of course :))
I’d note that [initial conditions] needn’t only be internal to the project—in principle we could have reason to believe that various external mechanisms would be likely to shift the project’s motivation sufficiently over time. (I don’t know of any such reasons)
I think the question becomes significantly harder once the primary motivation behind a project isn’t [let’s beat China!], but also isn’t [your ideal project motivation (with your ideal initial conditions)].
I note that my p(doom) doesn’t change much if we eliminate racing but don’t slow down until it’s clear to most decision makers that it’s necessary.
Likewise, I don’t expect that [focus on avoiding the earliest disasters] is likely to be the best strategy. So e.g. getting into a good position on security seems great, all else equal—but I wouldn’t sacrifice much in terms of [odds of getting to a sufficiently cautious overall strategy] to achieve better short-term security outcomes.
One thing I’d be bearish on is visibility into the latest methods being used for frontier AI methods, which would downstream reduce the relevance of alignment research except for the research within the manhattan-like project itself.
This is already somewhat true of the big labs eg. methods used for o1 like models. However, there is still some visibility in the form of system cards and reports which hint at the methods. When the primary intention is racing ahead of China, I doubt there will be reports discussing methods used for frontier systems.
I’m currently feeling very uncertain about the relative costs and benefits of centralization in general. I used to be more into the idea of a national project that centralized domestic projects and thus reduced domestic racing dynamics (and arguably better aligned incentives), but now I’m nervous about the secrecy that would likely entail, and think it’s less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project. Which is to say, even under pretty optimistic assumptions about how much such a project invests in alignment, security, and benefit-sharing, I’m pretty uncertain that this would be good, and with more realistic assumptions I probably lean towards it being bad. But it super depends on the governance, the wider context, how a “Manhattan Project” would affect domestic companies and China’s policymaking, etc.
(I think a great start would be not naming it after the Manhattan Project, though. It seems path dependent, and that’s not a great first step.)
Suppose the US government pursued a “Manhattan Project for AGI”. At its onset, it’s primarily fuelled by a desire to beat China to AGI. However, there’s some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.)
Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
My own impression is that this would be an improvement over the status quo. Main reasons:
A lot of my P(doom) comes from race dynamics.
Right now, if a leading lab ends up realizing that misalignment risks are super concerning, they can’t do much to end the race. Their main strategy would be to go to the USG.
If the USG runs the Manhattan Project (or there’s some sort of soft nationalization in which the government ends up having a much stronger role), it’s much easier for the USG to see that misalignment risks are concerning & to do something about it.
A national project would be more able to slow down and pursue various kinds of international agreements (the national project has more access to POTUS, DoD, NSC, Congress, etc.)
I expect the USG to be stricter on various security standards. It seems more likely to me that the USG would EG demand a lot of security requirements to prevent model weights or algorithmic insights from leaking to China. One of my major concerns is that people will want to pause at GPT-X but they won’t feel able to because China stole access to GPT-Xminus1 (or maybe even a slightly weaker version of GPT-X).
In general, I feel like USG natsec folks are less “move fast and break things” than folks in SF. While I do think some of the AGI companies have tried to be less “move fast and break things” than the average company, I think corporate race dynamics & the general cultural forces have been the dominant factors and undermined a lot of attempts at meaningful corporate governance.
(Caveat that even though I see this as a likely improvement over status quo, this doesn’t mean I think this is the best thing to be advocating for.)
(Second caveat that I haven’t thought about this particular question very much and I could definitely be wrong & see a lot of reasonable counterarguments.)
As you know, I have huge respect for USG natsec folks. But there are (at least!) two flavors of them: 1) the cautious, measure-twice-cut-once sort that have carefully managed deterrencefor decades, and 2) the “fuck you, I’m doing Iran-Contra” folks. Which do you expect will get in control of such a program ? It’s not immediately clear to me which ones would.
@davekasten I know you posed this question to us, but I’ll throw it back on you :) what’s your best-guess answer?
Or perhaps put differently: What do you think are the factors that typically influence whether the cautious folks or the non-cautious folks end up in charge? Are there any historical or recent examples of these camps fighting for power over an important operation?
Why is the built-in assumption for almost every single post on this site that alignment is impossible and we need a 100 year international ban to survive? This does not seem particularly intellectually honest to me. It is very possible no international agreement is needed. Alignment may turn out to be quite tractable.
A mere 5% chance that the plane will crash during your flight is consistent with considering this extremely concerning and doing anything in your power to avoid getting on it. “Alignment is impossible” is not necessary for great concern, isn’t implied by great concern.
I don’t think this line of argument is a good one. If there’s a 5% chance of x-risk and, say, a 50% chance that AGI makes the world just generally be very chaotic and high-stakes over the next few decades, then it seems very plausible that you should mostly be optimizing for making the 50% go well rather than the 5%.
Still consistent with great concern. I’m pointing out that O O’s point isn’t locally valid, observing concern shouldn’t translate into observing belief that alignment is impossible.
Yudkowsky has a pinned tweet that states the problem quite well: it’s not so much that alignment is necessarily infinitely difficult, but that it certainly doesn’t seem anywhere as easy as advancing capabilities, and that’s a problem when what matters is whether the first powerful AI is aligned:
Another frame: If alignment turns out to be easy, then the default trajectory seems fine (at least from an alignment POV. You might still be worried about EG concentration of power).
If alignment turns out to be hard, then the policy decisions we make to affect the default trajectory matter a lot more.
This means that even if misalignment risks are relatively low, a lot of value still comes from thinking about worlds where misalignment is hard (or perhaps “somewhat hard but not intractably hard”).
It’s not every post, but there are still a lot of people who think that alignment is very hard.
The more common assumption is that we should assume that alignment isn’t trivial, because an intellectually honest assessment of the range of opinions suggests that we collectively do not yet know how hard alignment will be.
If the project was fueled by a desire to beat China, the structure of the Manhattan project seems unlikely to resemble the parts of the structure of the Manhattan project that seemed maybe advantageous here, like having a single government-controlled centralized R&D effort.
My guess is if something like this actually happens, it would involve a large number of industry subsidies, and would create strong institutional momentum that even when things got dangerous, to push the state of the art forward, and in as much as there is pushback, continue dangerous development in secret.
In the case of nuclear weapons the U.S. really went very far under the advisement of Edward Teller, so I think the outside view here really doesn’t look good:
Good points. Suppose you were on a USG taskforce that had concluded they wanted to go with the “subsidy model”, but they were willing to ask for certain concessions from industry.
Are there any concessions/arrangements that you would advocate for? Are there any ways to do the “subsidy model” well, or do you think the model is destined to fail even if there were a lot of flexibility RE how to implement it?
I think “full visibility” seems like the obvious thing to ask for, and something that could maybe improve things. Also, preventing you from selling your products to the public, and basically forcing you to sell your most powerful models only to the government, gives the government more ability to stop things when it comes to it.
I will think more about this, I don’t have any immediate great ideas.
If you could only have “partial visibility”, what are some of the things you would most want the government to be able to know?
I have an answer to that: making sure that NIST:AISI had at least scores of automated evals for checkpoints of any new large training runs, as well as pre-deployment eval access.
Seems like a pretty low-cost, high-value ask to me. Even if that info leaked from AISI, it wouldn’t give away corporate algorithmic secrets.
A higher cost ask, but still fairly reasonable, is pre-deployment evals which require fine-tuning. You can’t have a good sense of a what the model would be capable of in the hands of bad actors if you don’t test fine-tuning it on hazardous info.
Worse than the current situation, because the counterfactual is that some later project happens which kicks off in a less race-y manner.
In other words, whatever the chance of its motivation shifting over time, it seems dominated by the chance that starting the equivalent project later would just have better motivations from the outset.
Can you say more about scenarios where you envision a later project happening that has different motivations?
I think in the current zeitgeist, such a project would almost definitely be primarily motivated by beating China. It doesn’t seem clear to me that it’s good to wait for a new zeitgeist. Reasons:
A company might develop AGI (or an AI system that is very good at AI R&D that can get to AGI) before a major zeitgeist change.
The longer we wait, the more capable the “most capable model that wasn’t secured” is. So we could risk getting into a scenario where people want to pause but since China and the US both have GPT-Nminus1, both sides feel compelled to race forward (whereas this wouldn’t have happened if security had kicked off sooner.)
One factor is different incentives for decision-makers. The incentives (and the mindset) for tech companies is to move fast and break things. The incentives (and mindset) for government workers is usually vastly more conservative.
So if it is the government making decisions about when to test and deploy new systems, I think we’re probably far better off WRT caution.
That must be weighed against the government typically being very bad at technical matters. So even an attempt to be cautious could be thwarted by lack of technical understanding of risks.
Of course, the Trump administration is attempting to instill a vastly different mindset, more like tech companies. So if it’s that administration we’re talking about, we’re probably worse off on net with a combination of lack of knowledge and YOLO attitudes. Which is unfortunate—because this is likely to happen anyway.
As Habryka and others have noted, it also depends on whether it reduces race dynamics by aggregating efforts across companies, or mostly just throws funding fuel on the race fire.
I think this is a (c) leaning (b), especially given that we’re doing it in public. Remember, the Manhattan Project was a highly-classified effort and we know it by an innocuous name given to it to avoid attention.
Saying publicly, “yo, China, we view this as an all-costs priority, hbu” is a great way to trigger a race with China...
But if it turned out that we knew from ironclad intel with perfect sourcing that China was already racing (I don’t expect this to be the case), then I would lean back more towards (c).
@davekasten @Zvi @habryka @Rob Bensinger @ryan_greenblatt @Buck @tlevin @Richard_Ngo @Daniel Kokotajlo I suspect you might have interesting thoughts on this. (Feel free to ignore though.)
(c). Like if this actually results in them behaving responsibly later, then it was all worth it.
What do you think are the most important factors for determining if it results in them behaving responsibly later?
For instance, if you were in charge of designing the AI Manhattan Project, are there certain things you would do to try to increase the probability that it leads to the USG “behaving more responsibly later?”
Some thoughts:
The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard’s reasons.
Given [new potential-to-shift-motivation information/understanding], I expect there’s a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
Specifically:
Who gets picked to run such a project? If it’s primarily a [let’s beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who’re cautious and highly adaptable?
Here I note that the kind of ‘caution’ we’d need is [people who push effectively for the system to operate with caution]. Most people who want caution are more cautious.
How is the project structured? Will the structure be optimized for adaptability? For red-teaming of top-level goals?
Suppose that a mid-to-high-level participant receives information making the current top-level goals questionable—is the setup likely to reward them for pushing for changes? (noting that these are the kind of changes that were not expected to be needed when the project launched)
Which external advisors do leaders of the project develop relationships with? What would trigger these to change?
...
I do think that it makes sense to aim for some centralized project—but only if it’s the right kind.
I expect that almost all the directional influence is in [influence the initial conditions].
For this reason, I expect [push for some kind of centralized project, and hope it changes later] is a bad idea.
I think [devote great effort to influencing the likely initial direction of any such future project] seems a great idea (so long as you’re sufficiently enlightened about desirable initial directions, of course :))
I’d note that [initial conditions] needn’t only be internal to the project—in principle we could have reason to believe that various external mechanisms would be likely to shift the project’s motivation sufficiently over time. (I don’t know of any such reasons)
I think the question becomes significantly harder once the primary motivation behind a project isn’t [let’s beat China!], but also isn’t [your ideal project motivation (with your ideal initial conditions)].
I note that my p(doom) doesn’t change much if we eliminate racing but don’t slow down until it’s clear to most decision makers that it’s necessary.
Likewise, I don’t expect that [focus on avoiding the earliest disasters] is likely to be the best strategy. So e.g. getting into a good position on security seems great, all else equal—but I wouldn’t sacrifice much in terms of [odds of getting to a sufficiently cautious overall strategy] to achieve better short-term security outcomes.
One thing I’d be bearish on is visibility into the latest methods being used for frontier AI methods, which would downstream reduce the relevance of alignment research except for the research within the manhattan-like project itself. This is already somewhat true of the big labs eg. methods used for o1 like models. However, there is still some visibility in the form of system cards and reports which hint at the methods. When the primary intention is racing ahead of China, I doubt there will be reports discussing methods used for frontier systems.
Depends on the direction/magnitude of the shift!
I’m currently feeling very uncertain about the relative costs and benefits of centralization in general. I used to be more into the idea of a national project that centralized domestic projects and thus reduced domestic racing dynamics (and arguably better aligned incentives), but now I’m nervous about the secrecy that would likely entail, and think it’s less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project. Which is to say, even under pretty optimistic assumptions about how much such a project invests in alignment, security, and benefit-sharing, I’m pretty uncertain that this would be good, and with more realistic assumptions I probably lean towards it being bad. But it super depends on the governance, the wider context, how a “Manhattan Project” would affect domestic companies and China’s policymaking, etc.
(I think a great start would be not naming it after the Manhattan Project, though. It seems path dependent, and that’s not a great first step.)
Can you say more about what has contributed to this update?
Something I’m worried about now is some RFK Jr/Dr. Oz equivalent being picked to lead on AI...