I don’t think value alignment of a super-takeover AI would be a good idea, for the following reasons:
1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it’s very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can’t correct for externalities happening down the road. (Speed also makes it more likely that we can’t correct in time, so I think we should try to go slow).
3) There is no agreement on which values are ‘correct’. Personally, I’m a moral relativist, meaning I don’t believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It’s very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.
I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I’m somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI’s input.
otto.barten
Offense/defense balance is such a giant crux for me. I would take quite different actions if I saw plausible arguments that defense will win over offense. I’m astonished that I don’t know any literature on this. Large parts of the space seem to be quite strongly convinced that offense will win or defense will win (at least, else their actions don’t make sense to me), but I’ve very rarely seen this assumption debated explicitly. It would really be very helpful if someone could point me to sources. Right now I have a twitter poll with 30 votes (result: offense wins) and an old LW post to go by.
I think that if government involvement suddenly increases, there will also be a window of opportunity to get an AI safety treaty passed. I feel a government-focused plan should include pushing for this.
(I think heightened public xrisk awareness is also likely in such a scenario, making the treaty more achievable. I also think heightened awareness in both govt and public will make short treaty timelines (a year to weeks), at least between the US and China, realistic.)
Our treaty proposal (a few other good ones exist): https://time.com/7171432/conditional-ai-safety-treaty-trump/
Also, I think end games should be made explicit: what are we going to do once we have aligned ASI? I think that’s both true for Marius’ plan, and for a government-focused plan with a Manhattan or CERN included in it.
I think this is a crucial question that has been on my mind a lot, and I feel it’s not adequately discussed in the xrisk community, so thanks for writing this!
While I’m interested in what people would do once they have an aligned ASI, what matters in the end is what labs would do, and what governments would do, because they are the ones who would make the call. Do we have any indications on that? What I would expect without thinking very deeply about it: labs wouldn’t try to block others. It’s risky, probably illegal and generally none of their business. They would try to make sure they are not blowing up the world themselves but otherwise let others solve this problem. Governments on the other hand would attempt to block other states from building super-takeover AI, since it’s generally their business to maintain power. I’m less sure they would also block their own citizens from building super-takeover AI, but leaning towards a yes.
Also two smaller points:
You’re pointing to universal surveillance as an (undesirable) way to enforce a pause. I think it’s not obvious that this way is best. My current guess is that hardware regulation has a better chance, even in a world with significant algorithmic and hardware improvement.
I think LWers tend to wave around with nuclear warfare too easily. In the real world, almost eighty years of all kinds of conflicts have not resulted in nuclear escalation. It’s unlikely that a software attack on a datacenter would.
Thanks for writing the post, it was insightful to me.
“This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year”
In your mind, what would be the best case outcome of such “alignment and other safety research”? What would it achieve?
I’m expecting something like “solve the alignment problem”. I’m also expecting you to think this might mean that advanced AI would be intent-aligned, that is, it would try to do what a user wants it to do, while not taking over the world. Is that broadly correct?
If so, the biggest missing piece for me is to understand how this would help to avoid that someone else builds an unaligned AI somewhere else with sufficient capabilities to take over. DeepSeek released a model with roughly comparable capabilities nine weeks after OpenAI’s o1, probably without stealing weights. It seems to me that you have about nine weeks to make sure others don’t build an unsafe AI. What’s your plan to achieve that and how would the alignment and other safety research help?
AI is getting human-level at closed-ended tasks such as math and programming, but not yet at open-ended ones. They appear to be more difficult. Perhaps evolution brute-forced open-ended tasks by creating lots of agents. In a chaotic world, we’re never going to know which actions lead to a final goal, e.g. GDP growth. That’s why lots of people try lots of different things.
Perhaps the only way in which AI can achieve ambitious final goals is by employing lots of slightly diverse agents. Perhaps that would almost inevitably lead to many warning shots before a successful takeover?
AI is getting human-level at closed-ended tasks such as math and programming, but not yet at open-ended ones. They appear to be more difficult. Perhaps evolution brute-forced open-ended tasks by creating lots of agents. In a chaotic world, we’re never going to know which actions lead to a final goal, e.g. GDP growth. That’s why lots of people try lots of different things.
Perhaps the only way in which AI can achieve ambitious final goals is by employing lots of slightly diverse agents. Perhaps that would almost inevitably lead to many warning shots before a successful takeover?
I don’t have strong takes on what exactly is happening in this particular case but I agree that companies (and more generally, people at high-pressure positions) are very frequently doing the kind of thing you describe. I don’t think we have an indication that this would not be valid for leading AI labs as well.
Re the o1 AIME accuracy at test time scaling graphs: I think it’s crucial to understand that the test-time compute x-axis is likely wildly different from the train-time compute x-axis. You can throw 10s-100s of millions of dollars at train-time compute and still run a company. You can’t do the same for test-time compute each calculation again. The scale at which test-time compute happens on a per-call basis, and can happen to keep things anywhere near commercial viability, needs to be perhaps eight OOMs below train-time compute. Calling anything happening there a “scaling law” is a stretch of the term (very helpful for fundraising) and at best valid very locally.
If RL is actually happening at a compute scale beyond 10s of millions of dollars, and this gives much better final results than doing the same at a smaller scale, that would change my mind. Until then, I think scaling in any meaningful sense of the word is not what drives capabilities forward at the moment, but algorithmic improvement is. And this is not just coming from the currently leading labs. (Which can be seen e.g. here and here).
Thanks for the offer, we’ll do that!
Not publicly, yet. We’re working on a paper providing more details about the conditional AI safety treaty. We’ll probably also write a post about it on lesswrong when that’s ready.
I’m aware and I don’t disagree. However, in xrisk, many (not all) of those who are most worried are also most bullish about capabilities. Reversely, many (not all) who are not worried are unimpressed with capabilities. Being aware of the concept of AGI, that it may be coming soon, and of how impactful it could be, is in practice often a first step towards becoming concerned about the risks, too. This is not true for everyone unfortunately. Still, I would say that at least for our chances to get an international treaty passed, it is perhaps hopeful that the power of AGI is on the radar of leading politicians (although this may also increase risk through other paths).
Proposing the Conditional AI Safety Treaty (linkpost TIME)
The recordings of our event are now online!
Announcing the AI Safety Summit Talks with Yoshua Bengio
My current main cruxes:
Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?
If there is reasonable consensus on any one of those, I’d much appreciate to know about it. Else, I think these should be research priorities.
When we decided to attach moral weight to consciousness, did we have a comparable definition of what consciousness means or was it very different?
AI takeovers are probably a rich field. There are partial and full takeovers, reversible and irreversible takeovers, aligned and unaligned ones. While to me all takeovers seem bad, some could be a lot worse than others. Thinking out specific ways to take over could provide clues on how to increase chances that this does not happen. In comms as well, takeovers are a neglected and important subtopic.
I updated a bit after reading all the comments. It seems that Christiano’s threat model, or in any case the threat model of most others who interpret his writing, seems to be about more powerful AIs than I initially thought. The AIs would already be superhuman, but for whatever reason, a takeover has not occured yet. Also, we would apply them in many powerful positions (heads of state, CEOs, etc.)
I agree that if we end up in this scenario, all the AIs working together could potentially cause human extinction, either deliberately (as some commenters think) or as a side-effect (as others think).
I still don’t think that this is likely to cause human extinction, though, mostly for the following reasons:
- I don’t think these AIs would _all_ act against human interest. We would employ a CEO AI, but then also a journalist AI to criticize the CEO AI. If the CEO AI would decide to let their factory consume oxygen to such an extent that humanity would suffer from it, that’s a great story for the journalist AI. Then, a policymaker AI would make policy against this. More generally: I think it’s a significant mistake in the WFLL threat models that the AI actions are assumed to be correlated towards human extinction. If we humans deliberately put AIs in charge of important parts of our society, they will be good at running their shop but as misaligned to each other (thereby keeping a power balance) as humans currently are. I think this power balance is crucial and may very well prevent things going very wrong. Even in a situation of distributional shift, I think the power balance is likely robust enough to prevent an outcome as bad as human extinction. Currently, some humans job is to make sure things don’t go very wrong. If we automate them, we will have AIs trying to do the same. (And since we deliberately put them at this position, they will be aligned with humans’ interests, as opposed to us being aligned with chimpanzee interest.)
- This is a very gradual process, where many steps need to be taken: AGI must be invented, trained, pass tests, be marketed, be deployed, likely face regulation, be adjusted, be deployed again. During all those steps, we have opportunities to do something about any threats that turn out to exist. This threat model can be regulated in a trial-and-error fashion, which humans are good at and our institutions accustomed to (as opposed to the Yudkowsky/Bostrom threat model).
- Given that current public existential risk awareness, according to our research, is already ~19%, and given that existential risk concern and awareness levels tend to follow tech capability, I think awareness of this threat will be near-universal before it could happen. At that moment, I think we will very likely regulate existentially dangerous use cases.In terms of solutions:
- I still don’t see how solving the technical part of the alignment problem (making an AI reliably do what anyone wants) contributes to reducing this threat model. If AI cannot reliably do what anyone wants, it will not be deployed at a powerful position, and therefore this model will not get a chance to occur. In fact, working on technical alignment will enormously increase the chance that AI will be employed at powerful positions, and will therefore increase existential risk as caused by the WFLL threat model (although, depending on pivotal act and offense/defence balance, solving alignment may decrease existential risk due to the Yudkowsky/Bostrom takeover model).
- An exception to this could be to make an AI reliably do what ‘humanity wants’ (using some preference aggregation method), and making it auto-adjust for shifting goals and circumstances. I can see how such work reduces this risk.
- I still think traditional policy, after technology invention and at the point of application (similar to e.g. the EU AI Act) is the most useful regulation to reduce this threat model. Specific regulation at training could be useful, but does not seem strictly required for this threat model (as opposed to in the Yudkowsky/Bostrom takeover model).
- If one wants to reduce this risk, I think increasing public awareness is crucial. High risk awareness should enormously increase public pressure to either not deploy AI at powerful positions at all, or demanding very strong, long-term, and robust alignment guarantees, which would all reduce risk.In terms of timing, although likely net positive, it doesn’t seem to be absolutely crucial to me to work on reducing this threat model’s probability right now. Once we actually have AGI, including situational awareness, long-term planning, an adaptable world model, and agentic actions (which could still take a long time), we are likely still in time to regulate use cases (again as opposed to in the Yudkowsky/Bostrom takeover model, where we need to regulate/align/pause ahead of training).
After my update, I still think the chance this threat model leads to an existential event is small and work on it is not super urgent. However, I’m less confident now to make an upper bound risk estimate.
I want to stress how I hugely like this post. What to do once we have an aligned AI of takeover level, or how to make sure no one will build an unaligned AI of takeover level, is in my opinion the biggest gap in many AI plans. I think answering this question might point to filling gaps that are currently completely unactioned, and I therefore really like this discussion. I previously tried to contribute to arguably the same question in this post, where I’m arguing that a pivotal act seems unlikely and therefore conclude that policy rather than alignment is likely to make sure we don’t go extinct.
I would say this is a pivotal act, although I like the sound of enforcing a moratorium better (and the opening it perhaps gives to enforcing a moratorium in the traditional, imo much preferred way of international policy).
I’m hereby providing a few reasons why I think a pivotal act might not happen:
A pivotal act is illegal. One needs to break into other people’s and other countries’ computer systems and do physical harm to property or possibly even people to enact it. Companies such as OpenAI and Anthropic are, although I’m not always a fan of them, generally law-abiding. It will be a big step for their leadership to do something as blatantly unlawful as a pivotal act.
There is zero indication that labs are planning to do a pivotal act. This may obviously have something to do with the point above, however, one would have expected hints from someone like Sam Altman who is hinting all the time, or leaks from people lower in the labs, if they were planning to do this.
The pivotal act is currently not even discussed seriously among experts and in fact highly unpopular in the discourse (see for example here).
If the labs are currently not planning to do this, it seems quite likely they won’t when the time comes.
Governments, especially the US government/ military, seem more likely in my opinion to perform a pivotal act. I’m not sure they will call it a pivotal act or necessarily have an existential reason in mind while performing it. They might see this as blocking adversaries from being able to attack the US, very much in their Overton window. However, for them as well, there is no certainty they would actually do this. There are large downsides: it is a hostile act towards another country, it could trigger conflict, they are likely to be uncertain how necessary this is at all, and uncertain what the progress is of an adversary project (perhaps underestimating it). For perhaps similar reasons, the US has not blocked the USSR atomic project before they had the bomb, even though this could have arguably preserved a unipolar instead of multipolar world order. Additionally, it is far from certain the US government will nationalize labs before they reach takeover level. Currently, there is little indication they will. I think it’s unreasonable to place more than say 80% confidence in the US government or military successfully blocking all adversaries’ projects before they reach takeover level.
I think it’s not unlikely that once an AI is powerful enough for a pivotal act, it will also be powerful enough to generally enforce hegemony, and not unlikely this will be persistent. I would be strongly against one country, or even lab, proclaiming and enforcing global hegemony for eternity. The risk that this might happen is a valid reason to support a pause, imo. If we get that lucky, I would much prefer a positive offense defense balance and many actors having AGI, while maintaining a power balance.
I think it’s too early to contribute to aligned ASI projects (Manhattan/CERN/Apollo/MAGIC/commercial/govt projects) as long as these questions are not resolved. For the moment, pushing for e.g. a conditional AI safety treaty is much more prudent, imo.