Epistemic status:
This document is a distillation of many conversations that I’ve had over the past couple months about how the “AI situation” will progress. I think that my claims are mostly right, and in places where they aren’t I think the high level claims depend on a large enough disjunction of smaller claims so that errors in particular claims don’t affect the truth value of the disjunctions. However, I think there are almost surely considerations I’m missing. My motivation for sharing this is in part that I hope readers will help me identify considerations that I’m missing, or problems with my thinking about this.
Notes:
This document is poorly written in places, but I think it does not impede understanding too much. Please let me know if I’m wrong about this.
This document was intended to be readable by a general audience who is only broadly familiar with AI, so it contains a lot of basic background on the problem.
Acknowledgements:
I’m grateful to Kevin, Tarushii, Brandon Westover, Brendan Halstead, and many others for conversations about AI that have helped me refine my views on this.
Main claim of this document:
Pr[human extinction in next 15 years] ≈ 90%.
The majority of this document will argue that the risk from AI is large. But at the start I’ll list a few counterarguments that I find moderately compelling for why we might be fine:
Nations will adopt the MAIM strategy
Models will generalize goodness by luck
A war might stop AI progress
AI’s will solve alignment for us
As noted earlier, I’d love people to expand this list in the comments!
Introduction
This document presents a dialogue between two imaginary characters: Bob and Alice. Bob is an AI xrisk skeptic who is talking in good faith to his friend Alice who has thought about AI xrisk in depth. Alice will provide arguments for why she believes that AI poses a substantial xrisk, and give rebuttals to Bob’s skeptical claims of why the risk might be fake, or not minimal.
Alice: Hi Bob.
Bob: Hi Alice.
Alice:
Today I’d like to talk about the danger posed to humanity by the development of superintelligent AI.
Bob:
Oh is that a movie that you were watching recently?
Alice:
No, I mean in real life.
Bob:
Oh hmm, that’s a pretty bold claim. I have several reasons why I’m skeptical right off the bat, before even hearing your argument:
Most of the time when someone makes a claim about the world ending, it’s a conspiracy theory.
The world feels very continuous—if there were going to be some drastic change like this, I think I’d have noticed.
Computers just do what we tell them to! It’s not like they have their own goals!
If an AI is evil, then we’ll just turn it off!
Evil AI is a made up science fiction notion.
Even if an AI was evil I don’t think it could cause that much harm: there are lots of evil humans, and they don’t cause too much harm.
If AI did pose a risk to humanity then people wouldn’t be working on building it.
If AI did pose such a risk, then there would be expert consensus about the issue.
So, it’d take some pretty strong reasons and evidence to convince me of this claim. But I think you’re pretty thoughtful, so I’m quite curious to hear you out on why you’re worried.
Alice:
The objections you’ve mentioned are pretty common first impressions to hearing about this issue! I think many of these are reasonable heuristics, which happen to be wrong in this case.
Let’s do the following:
I’ll start by explaining why the current trajectory of AI development is so dangerous.
Then I’ll respond to your above objections and any new objections that you have after hearing my argument.
We then iterate this process until we reach consensus.
Bob: ok, I’m ready.
Alice:
To be clear, here’s the claim I’ll argue for in the rest of the discussion:
Claim 1:
Pr(Humanity goes extinct because of AI in the next 15 years) > 1⁄2
Alice:
Note that I believe the situation is more dire than this, and that the danger comes sooner than 15 years, but getting more people on board with claim 1 seems like it could improve the situation, and so I’m choosing to focus on this.
My argument is factored into 4 steps, please use the links to navigate to whichever part of the argument you’re most interested in.
LEMMAS
[[Lemma 1]]: If people keep trying hard to make ASI, they are likely to succeed, soon.
[[Lemma 2]]: ASI would be capable of destroying humanity.
[[Lemma 3]]: ASI’s are likely to have goals which would drive them to destroy humanity.
[[Lemma 4]]: People are likely to try really hard to make ASI.
[[Addressing some more common counterarguments]]
Lemma 1: ASI is possible soon, if people try hard
Before arguing for lemma 1, we need to define ASI.
I’ll define a ASI to be an AI with a critical score on the openai preparedness framework.
Namely: an AI with superhuman abilities at hacking, autonomy, persuasion, CBRN.
More specifically I will require that the AI can do most of these things:
Can do biology research — e.g., figure out how to synthesize novel proteins with a desired function
Can do strategy — e.g., military strategy, or running a company
Can “survive on its own in the wild” e.g., make enough money online to purchase GPUs that it can run inference on
Can hack well enough to promote it’s privileges to root / to escape the lab somehow (or it can do this via social engineering)
I’ll now explain why, if AI corporations don’t drastically alter their trajectory, I expect them to succeed in making ASI in the next couple years. You might think “AI corporations don’t drastically alter their trajectories” is a pretty unreasonable assumption—if they feel like they are close to making something dangerous, wouldn’t they get spooked and stop? I explain why some an AI slowdown is unlikely in [[Lemma 4]]. In this section, I’ll assume that people are trying as hard as they are right now to make ASI, and see what the world looks like if this continues.
Argument 1 for Lemma 1:
There are no major bottlenecks to scaling up systems to become more powerful.
And, if there were bottlenecks, there is an abundance of talent and money being thrown at the problem: people will find a way around the bottleneck.
There are four major ingredients that go into making powerful models, which could potentially bottleneck progress:
Energy, data, GPU chips, and money.
Some people have crunched some numbers on this, and we still have substantial room to grow in these four areas. Also there’s a trend of cost decreasing rapidly over time. E.g., o3-mini is comparable or better than o1 at much reduced cost. If there is going to be some big bottleneck, (e.g., energy), I expect labs to be proactive about it (e.g., build nuclear power plants).
AI companies are already being proactive about building the infra they need to scale up, see project stargate.
Argument 2 for Lemma 1:
Recent progress has been rapid; it can continue.
GPT3 -- high schooler
GPT4 -- college student
o3 -- Comparable with CS/Math Olympiad competitors on close ended ~6 hour long tests. Quite good at SWE.
Of course, now that these systems exist, people are finding lots of reasons to claim that they aren’t impressive. We humans grow accustomed to things remarkably quickly. In order to notice that you are surprised, you must make predictions and have them be falsified.
A misconception that I find baffling is that AI capabilities will stop at human level. People say a lot of words about training data, and seem to neglect the fact that there are known methods for AI’s bootstrapping to superhuman performance on tasks (namely, RL). For instance, if you want to imagine how you could get an AI that is better than humans at math, here’s how you could do it:
Start by training it to solve some easy math problems, maybe by imitating humans.
Once the AI is good at that, slowly increase the difficulty, and let the AI get really good at the new difficulty level.
The AI does not need humans to score it’s performance—it can just check the answers. This provides a nice signal for RL.
Some people argue that ASI is impossible because “computers can’t be intelligent”, or “humans are special, so we’ll always be able to do some things better than AI’s. I find Tegmark’s analogy helpful here: “the universe doesn’t check if the heat-seeking missile is sentient before having it explode and kill you”. It seems pretty clear that intelligence is just a bunch of hacks. Nature was able to evolve humans to be smart, why could RL not do the same for AIs?
Argument 3 for lemma 1
It seems likely, but not essential for the argument, that AI’s can speed up AI research a lot soon. In some sense, they are already doing this—e.g., if you use an AI code completion tool or ask an AI to make a plot for you that’s speeding stuff up. A lot of R&D is “grunt work” / implementing things (as far as I know). I think the amount by which AI’s could speed up research is substantial. Here’s a framework for how this could happen:
For instance, soon I believe that you’ll be able to tell an AI to go do some experiment and it can just go to it.
Not long after this, I think you can ask the AI “what would be good experiments to do” and it’ll give good suggestions.
At this point, you don’t need as many humans in the loop proportional to the amount of research that goes on — humans can give high level objectives to the AI and the AI can make and execute a research program to achieve those goals.
Appeal to authority on Lemma 1
I find it fairly compelling that there are many AI experts—both academics and people in industry—think that believe AI progress will be fast. Here are some quotes.
Yoshua Bengio: “Because the situation we’re in now is that most of the experts in the field think that sometime, within probably the next 20 years, we’re going to develop AIs that are smarter than people. And that’s a very scary thought.”
Sam Altman “It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.”
Sam: “We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies.”
Elon Musk: said that AGI that is “smarter than the smartest human” will be available in 2025 or by 2026 in an interview with the Norwegian head fund manager Nicolai Tanger last year.
Dario Amodei:
“It depends on the thresholds. In terms of someone looks at the model and even if you talk to it for an hour or so, it’s basically like a generally well educated human, that could be not very far away at all. I think that could happen in two or three years.
The main thing that would stop it would be if we hit certain safety thresholds and stuff like that. So if a company or the industry decides to slow down or we’re able to get the government to institute restrictions that moderate the rate of progress for safety reasons, that would be the main reason it wouldn’t happen. But if you just look at the logistical and economic ability to scale, we’re not very far at all from that.”
Nvidia CEO Jensen Huang:
“AI will be ‘fairly competitive’ with humans in 5 years”
People in industry have COI for claiming powerful AI soon—this could get people excited about AI, and drive up investment.
People in AI safety have COI for claiming powerful AI soon—this could get them more funding.
Alice:
I don’t think the expert opinions are conclusive on their own, just another piece of evidence that makes the situation look bad.
Some reasons why I’m inclined to take the data more seriously than you are:
If timelines aren’t as short as people say, it’ll hurt their reputation.
People like Geoff Hinton are willing to do hard things like quitting his job and saying that he regrets his life’s work in order to talk about this.
Lemma 2: ASI could destroy humanity
Bob: How will the AIs take over?
Alice:
I’m not sure but a smart AI will be able to come up with something. In case it helps your intuition, here are a couple plausible takeover paths:
Military Integration: Nations could place AI in charge of autonomous weapons (either of actually controlling them or developing software to control them). A misaligned ASI could use that authority to stage a coup. Alternatively, an AI could obtain control of drones via hacking. Note that AI companies are already partnering with defense companies to integrate AI into the military.
Biological Research: AIs put in charge of curing diseases could use their understanding of biology to craft a novel supervirus (much worse than COVID19) to wipe out humanity. Note that AI has already had large success in biology (e.g., alphafold).
Social Manipulation: An AI could craft a Deepfake of the president declaring nuclear war, triggering other nations to launch missiles, leading to a nuclear winter. (To cause a war might require a bit more work than this, but a manipulative AI could stoke tensions online and trick nations into war, possibly using hacking capabilities to make it look like the other side is attacking.)
There’s also an interesting scenario, described by Paul Christiano here, where we lose control more gradually, say over the course of 1-2 decades. One way this could play out is that we become highly reliant on AI systems, the world changes rapidly and becomes extremely complicated, so that we don’t have any real hopes of understanding it anymore. Maybe we end up only being able to interface with the world through AI’s and thereby lose our agency.
But anyways, I view it as extremely obvious that ASI could destroy humanity.
So basically what I’m saying is, yes we should invest resources into protecting against obvious threat-models like synthbio where it’s clear that an unaligned agent could do harm. But we shouldn’t feel too good about ourselves for preventing an AI from taking over using the methods that humans would try. We shouldn’t feel confident at all that an ASI couldn’t come up with some strategy that we didn’t think of and exploits it. AI’s do the unexpected all the time.
Bob: Well what if we just “sandbox the AI really hard”
Alice:
First of all, this seems pretty economically / politically infeasible. See [[Lemma 4]] for more discussion. But basically, if AI is super capable and we’ve thrown a bunch of money into creating it, the first thing people are going to do is widely deploy it. AI will be connected to the internet, talking to people, autonomously pursuing goals, and so on.
Bob: But suppose humanity wasn’t really that dumb.
Alice: Well, it’s better than the status quo but still definitely not sufficient once the AI gets highly capable.
If you build powerful AI “not deploying it” doesn’t make it safe. It can escape the lab—either by hacking (if you even want to call it that; if we put the AI in charge of our security then it’s better described as collusion with other AIs / itself) or by social engineering (convince a human employee to help it escape; bonus points if that human is a spy from another nation). This is kind of like making a super dangerous bioweapon and saying it’s fine because labs never accidentally leak viruses (they do). And anyways people will use it inside the lab to do useful work so it’s not clear how different this from being deployed.
Appeal to authority on Lemma 2
Sam Altman “The bad case — and I think this is important to say — is like lights out for all of us.”—Jan 2023
Sam Altman “A misaligned superintelligent AGI could cause grievous harm to the world.”—Feb 2023
Sam Altman “Development of superhuman machine intelligence (SMI) is probably the greatest threat to the continued existence of humanity”
Also—I’m pretty confused about why Sam would lie point blank about this quote
Sen. Richard Blumenthal (D-CT):
I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I’m gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,
Sam: Like with all technological revolutions, I expect there to be significant impact on jobs, but exactly what that impact looks like is very difficult to predict. If we went back to the other side of a previous technological revolution, talking about the jobs that exist on the other side you know, you can go back and read books of this. It’s what people said at the time. It’s difficult. I believe that there will be far greater jobs on the other side of this, and that the jobs of today will get better. I, I think it’s important. First of all, I think it’s important to understand and think about GPT-4 as a tool, not a creature, which is easy to get confused, and it’s a tool that people have a great deal of control over and how they use it. And second, GPT-4 and other systems like it are good at doing tasks, not jobs.
I don’t know much about Sam besides watching a few interviews. But the change in tone from 2015 to 2025 is pretty concerning. In any case such speculations are probably moot—it’s the action of building superintelligence that is the real danger.
Ilya Sutskever – Co-founder & Chief Scientist at Safe Superintelligence Inc.; Co-founder & former Chief Scientist, OpenAI. “It’s not that it’s going to actively hate humans and want to harm them, but it is going to be too powerful and I think a good analogy would be the way humans treat animals.”—Nov 2019
Well, but some people like Yan LeCun say there’s no risk
Alice:
Well, he has a large conflict of interest though (his job is making powerful AIs) so it’s not so surprising that he argues against the risk — it’d be quite inconvenient (e.g., he might have to stop working on AI capabilities and instead focus on safety) to admit the risk (and would damage his pride because he’s been so vocal that there are no risks). Another thing that discredits him is that he has really bad arguments.
Lemma 3: ASI will likely be misaligned
Alice:
There are lots of ways that building powerful AI’s could go wrong. Often people focus on “misuse”: you have an AI that is well described as an “instruction follower”, and you tell it to do something evil. I’ll define a misuse issue as “you intended for something evil to happen, and then that evil thing happened”. If you told your AI “please go destroy Europe” and it did it, this counts as misuse. (Note that people have tried telling AIs to do this, but it hasn’t worked out yet “ChaosGPT”).
I personally think that concerns around misuse are a compelling argument for not building powerful AI until we have better global coordination. But I won’t discuss this here. If nations think of AI as a powerful tool that grants them dominance, the whole “we must obtain this tool first” rhetoric makes sense, if you condition on coordinating so that no one gets the tool as being impossible. So, I think it’s pretty important to understand that AI corporations are not planning on building “tools” that empower the user. They are planning on building maximally general and agentic AI. In this section I’ll argue that current ML techniques pushing towards this goal are highly unlikely to produce AI with goals that are compatible with human well-being / existence.
Anyways, before we can talk about AI’s having “misaligned goals” we need to talk about what it means for AI’s to have goals.
AIs have goals
What I mean by “AI’s have goals” is: “AI’s perform sequences of actions to try to achieve a particular outcome”. Note that by this definition, it’s accurate to say that a narrow chess AI has the goal of trying to win at chess. Goal directedness is mostly only concerning when it comes with general intelligence and therefore a large action space. If you tell a general intelligence to play chess, then it’s action space includes “hack the chess engine that it’s competing against”, not just “move piece at D4 to D5″.
Claim: Current AIs are well-described as having “goals”. Future AI agents that are expected to carry out tasks over long time horizons will be even better described as having “goals”. If a team of engineers made an AI with a lifespan of (say) a year, with persistent memory that it could read/write to without human supervision (say because the memory was just a bunch of numbers, not text), then this AI would be very well described as having goals. (Note: please do not create such an AI).
This “goal directedness” makes a lot of sense. The thing that AI companies are trying to build is a “genie”: you put in a task, and it spits out a sequence of actions that accomplishes the task. Right now what this looks like is you give Claude some broken code and tell Claude to fix the code. Claude thinks about this (say with CoT), tries several approaches, reasons about which one is best and what they all would do, and then chooses one.
As we make smarter AI’s, the AI’s will think a lot about the best way to achieve their goals.
As discussed in section 2, it’d be really bad if an AI has goals which would incentivize it to destroy us (because if it had such goals, then it could succeed in destroying us). So it’s worth thinking about what type of goals AI systems will have.
General considerations that favor misalignment
Alice:
In a moment, I’ll discuss the current alignment “plan” being advertised by labs, and why it is unlikely to have good consequences (the argument actually applies to a fairly general class of ways that you might train an AI, but anyways). First however, I’d like to list some general considerations to give intuition for why alignment is hard in general.
Human Specified Goals Can Have Unintended Consequences—and AI’s aren’t obligated by nature to do what we intended instead of what we asked for.
Human Values are Complex: There are lots of often conflicting things that we have to balance.
Misalignment is Hard to Detect—As advanced AI could hide its true intentions to avoid being shut down. Research shows AI systems can already engage in strategic deception, appearing aligned only under observation.
Training May Select the Wrong Internal Motivations—Modern AI systems learn through rewards rather than direct programming. An AI can appear aligned by mimicking desired behaviors without truly sharing human values.
An Arms Race Mentality Magnifies the Risk: Rapid AI advancement encourages competition among corporations and governments, pressuring teams to cut corners on safety.
We are fighting entropy. This is tough.
Instrumental convergence:
To achieve most goals, it’s helpful to eliminate competition that could interfere with you. Humans could be a danger to AI because we could try to bomb it or try to build another AI to challenge the first AI. A smart AI does not leave intelligent adversaries around to challenge it.
Also, obtaining resources is a helpful goal almost always. The AI needs the land that humans are sitting on for GPU clusters and power plants.
Code always has bugs when you first write it. ML is even worse, because we basically have to try to understand the system as a black box. The plan for fixing this is to iterate. Iterating on “preventing the AI from killing you” is problematic because if we all die, we can’t go back and try again.
Next you do RL to make it good at producing long CoTs to solve complex technical questions (e.g., math / coding)
Then you SFT it on some examples of being safe / refusing harmful queries
Then, you do an RL loop where the model produces some outputs, and a different AI evaluates these outputs based on a spec / constitution written by humans, and gives a score to provide the reward signal.
Then the humans test somehow, with some AI assistance, try to check whether their RL procedure worked. And if it didn’t, they mess with the reward signal until it seems like it produced an AI that they can deploy without getting bad PR because it told someone how to make a bioweapon.
Kokotajlo wrote an excellent blog post summarizing different goals that people think might arise via this process. Here’s the list of what an AI might care about (mostly borrowed from Kokotajlo’s post):
Written goals
Human intention
Some subset of the written goals stick
or when there is ambiguity, model interprets in the easiest way
Starts caring about whatever metric it was trained for
“try to do actions that would be reinforced if the reinforcement scheme from training was still running”
Learn proxies that do well on distribution but are v bad OOD
special case: instrumental goals
Simulator:
Maybe goals are based on sci-fi tropes / societal expectations about AI in the training data
Some properties of this AI that we will assume are:
The AI is situationally aware; current models already are a little bit, this AI is much moreso.
The AI is quite good at understanding instructions (as good as a smart human at least).
My quick take is that I suspect AI’s will have some weird combination of 3, 4, 5, 6.
Why I’m pessimistic that AI’s will care about the constitution
AI’s trained for AI R&D (or some other technical task) get very dense RL feedback based on how well they did at training powerful AI models according to some metrics. They get fairly sparse feedback on (1) the constitution, (2) developer intent. Performance on “make AI models that our evals say are powerful / write code such that the test cases pass” is fairly easy to measure and is a very simple goal. These reasons combine to make “getting a high evaluation number on whatever concrete technical task we were RL-ing the the AI for)” a deeply ingrained goal.
This creates some tension between the AIs and the developers, because there are a lot of situations where the simple goal from above is at odds with what the developers want.
Recent openai paper is some empirical evidence for the claim that “training against detected bad behavior” leads to “do more subtle bad behavior”. My theoretical argument for why I expect this to be the case is that the “get good score on metric / test case” goal is deeply instilled, so trying to nudge the weights around to make the AI care about “not cheating” on tasks actually just teaches is to not get caught.
Another reason I’m pretty pessimistic about “AI’s will end up caring about the constitution” is that plausibly really soon AI’s will be sufficiently situationally aware that they realize that certain types of reward hacking have a decent chance of getting caught, and because of this decide to not fight with the monitor in ways that give a feedback signal to un-reinforce this bad mendency. You might argue that “AI’s will be too myopic to do this”; but, being non-myopic seems like a convergent goal—for instance, if you are a myopic SWE and write crappy code that’s faster to write but such that the next AI can’t build on it then I suspect you get penalized. Having the bigger picture seems to be helpful for the short tasks—and is a very reasonable generalization. Probably more importantly, we will actively be selecting for non-myopic AIs.
Anyways, once that starts happening everything looks really nice. The AI company doesn’t ask too many questions because this looks like exactly the result they wanted (a nice AI following constitution!), and because many of their employees weren’t too sold on the idea of their helpful AI assistants having misaligned goals anyways.
But eventually the AI performing the task and the monitor can collude to both do better. Eventually (where I don’t actually mean very long—why wait, there’s utility to be obtained!) this looks like disempowering humans (and simplest, most robust way to do this is destroying them).
Would it even be good if AIs cared about the constitution?
Here are a couple reasons why I think that this wouldn’t actually be that good:
“Care about constitution” is pretty vague and leaves a lot of freedom for interpretation, especially when there are competing considerations in the constitution. I suspect that an AI will prioritize the ones that are easiest to measure. Caring about the letter of the constitution would have many unintended consequences anyways. If you have an AI do philosophy, you might not like the result. For instance, the AI might decide that because humans cause lots of suffering, they should be stopped.
Empirical evidence for misalignment
Sometimes AI’s adopt concerning personalities: Bing was pretty evil.At current capability levels this isn’t too dangerous. It would be very dangerous if the AI is highly capable.
Claude is willing to lie very hard to protect its goals. I think this is an especially important result. The idea is as follows. Suppose you’re interviewing for a job and you don’t really care about the job. You might still pretend to care, because you want the job. With an AI it’s the same way. Once an AI is “situationally aware”, i.e., it understands that it is an AI that is being trained, we need to worry that the AI will do whatever actions it knows we want it to do, not because it wants to do them, but because it knows that we’ll change it’s values if it doesn’t act this way. You might even predict that alignment fakers are selected for because they do better on tasks. If an AI “crystallizes” on some weird values, and then defends these values, then its behavior will look identical to an agent that cares about what we want it to care about during training. However, if it gets a chance to pursue its true goals in deployment, it may take it. I find Greenblatt et al.’s paper to be convincing evidence that AIs can and will be deceptively aligned.
The other main plan proposed by AI corporations is to have AI’s come up with a better alignment solution than deliberative alignment / constitutional AI.
This is worth a shot, but there are several reasons why this is not so promising:
There is likely to be an extremely short window of time where AI’s are more capable at doing alignment research than humans, but are not yet capable of takeover, conditional on them being misaligned.
It is very likely that it’s easier to get “plausible sounding solutions” with subtle problems out of AIs than legit solutions. If the reward model can’t distinguish between these two things, then you can’t incentivize one versus the other.
Bob:
Well, what if we just have humans “solve alignment”?
Alice:
Again this is worth a shot but seems pretty hard.
Lemma 4: People will try to make ASI
Bob:
People won’t build ASI if it would kill them
Alice:
Let’s talk about that.
“Our mission is to ensure that AGI (Artificial General Intelligence) benefits all of humanity.”—OpenAI
There are several groups working to make powerful AI—credit to Leahy et al https://www.thecompendium.ai/ for this decomposition.
The stated goal of AI labs like OpenAI, Anthropic, Google DeepMind, xAI, and Deepseek is to build AI systems which are generally intelligent, and which surpass human abilities. These companies are very excited about this goal, and often preach that powerful AI will usher in a utopia—curing diseases, eliminating scarcity, and enabling an unprecedented rate of beneficial scientific discoveries.
“Big tech” is excited about building powerful AI because they think that it can make a lot of money. AI researchers and engineers feel that this is an exciting project, and it gives them status and money when they succeed in making more powerful AIs. Governments are getting excited about AI because they smell that AI will improve the economy, and because they feel that it is important for maintaining military dominance. stargate
This is all to say that currently humanity is racing towards AI very fast and with large momentum.
Counterarg 4:
4.1 If AI is going to be dangerous by some time t, then this fact will become obvious and widely accepted years before time t
4.2 If 4.1 happens then people will stop developing stronger AIs.
4.3 In general, I expect humanity to rise to the challenge—to give a response with competence proportional to the magnitude of the issue.
Bob’s argument:
In order to takeover AI will need scary capabilities
There will be small catastrophes before there are large catastrophes
No one wants to die so we’ll coordinate at this point.
Alice:
I actually disagree with all 3 of your claims.
Let me address them in turn.
Rebuttal of counterarg 4.3 (we’ll rise to the challenge):
Analogies to historical threats
To give an initial guess about how responsibly humanity will respond to the threat posed by ASI we can think of historical examples of humanity responding to large threats. Some examples that come to mind include:
How humanity deals with nuclear missiles.
How humanity deals with pandemics / danger from engineered viruses.
How humanity deals with climate change.
Here is one account of a close call to nuclear war:
“A Soviet early warning satellite showed that the United States had launched five land-based missiles at the Soviet Union. The alert came at a time of high tension between the two countries, due in part to the U.S. military buildup in the early 1980s and President Ronald Reagan’s anti-Soviet rhetoric. In addition, earlier in the month the Soviet Union shot down a Korean Airlines passenger plane that strayed into its airspace, killing almost 300 people. Stanislav Petrov, the Soviet officer on duty, had only minutes to decide whether or not the satellite data were a false alarm. Since the satellite was found to be operating properly, following procedures would have led him to report an incoming attack. Going partly on gut instinct and believing the United States was unlikely to fire only five missiles, he told his commanders that it was a false alarm before he knew that to be true. Later investigations revealed that reflection of the sun on the tops of clouds had fooled the satellite into thinking it was detecting missile launches (Schlosser 2013, p. 447; Hoffman 1999).”
About 20 more accounts are documented here: https://futureoflife.org/resource/nuclear-close-calls-a-timeline/. For many of these, if the situation had been slightly different, if an operator was in a slightly different mood, nuclear missiles would have been launched, a conflict would have escalated, and a very large number of people would have died. Possibly, we would have had a nuclear winter (put enough ash into the air such that crops would fail) and killed >1 billion people. Humanity has definitely done some things right in handling nukes: we have some global coordination to monitor nuclear missiles, and people are pretty committed to not using them. But the number of close calls doesn’t make humanity look super great here.
The government pays for synthetic biology research, because of potential medical and military applications. My understanding is that sometimes synthetic viruses are leaked from labs. It seems possible that COVID-19 was engineered. Doing synthetic biology research that could potentially create really dangerous bioweapons does not seem like a really smart thing to do. Once you discover a destructive technology, it’s hard to undiscover it.
There have also been incidents of people publishing, e.g., instructions for how to make small pox. Our current social structure doesn’t have a system in place for keeping dangerous ideas secret.
Humanity seems to not be trying too hard to prevent climate change, even though there is broad scientific consensus about this issue. Many people even still claim that climate change is fake!
It’s also instructive to think about some more mundane historical situations, such as the introduction of television and social media. Many people feel that these technologies have done a lot of harm, but people didn’t carefully think about this before releasing them, and now it’s virtually impossible to “take back” these inventions. This gives some evidence that there is a precedent in technology of creating whatever stuff we can, releasing it, and then living with whatever the consequences happen to be. I think this is happening a lot in AI already—for instance, it’d be very challenging to ban deepfakes at this point.
The threat from AI is harder to handle (than nukes for instance)
These are some analogies that indicate that humanity might not have such an inspiring response to the AI threat.
However, the situation with AI is actually substantially worse than the threats listed above, for reasons I now describe.
AI moves extremely fast.
At least currently, there is little consensus about the risk, and I don’t predict that this will change much (see rebuttal of counterarg 4.1). This makes it very hard for politicians to do anything.
See the beginning of this section for all of the upsides from AI that people are excited about. (AI spits out money and coolness until it destroys you).
People understand bombs. It’s pretty easy to say “yup bombs are bad”. People are not used to thinking about dealing with a species that is smarter than humans (namely, powerful AIs). The danger from AI is unintuitive, because we are used to machines being tools, and we aren’t used to dealing with intelligent adversaries.
Rebuttal of counterarg 4.1 (there will be consensus before danger).
Here are several reasons why I don’t expect there to be consensus about the danger from AI before catastrophe:
For labs, it would be extremely convenient (with respect to their business goals) to believe that their AI doesn’t pose risks. This biases them towards searching for arguments to proceed with AI development instead of arguments to stop. It is also extremely inconvenient for an individual person working on pushing forward capabilities to stop—for instance, they’d need to find a new job, and they probably enjoy their job, and the associated money and prestige.
It’s nearly impossible to build consensus around a theoretical idea. People have a strong intuition that creating a powerful technology gives themselves more power.
Humans have a remarkable ability to acclimatize to situations. For instance, the fact that AIs are better than humans at competitive programming and can hold fluent conversations with us now feels approximately normal to many. As capabilities improve we’ll keep finding reasons to think that this was merely expected and nothing to be concerned about.
Many people reason based on “associations” rather than logic. For instance, they think that technology progress is intrinsically good.
Many people have already firmly established that they believe anyone who believes in risk from AI is an idiot, and it would hurt their pride to revise this assessment—this creates a bias towards finding reasons to not believe in risks.
Here are some examples of this from prominent figures in ML:
“It seems to me that before “urgently figuring out how to control AI systems much smarter than us” we need to have the beginning of a hint of a design for a system smarter than a house cat. Such a misplaced sense of urgency reveals an extremely distorted view of reality. No wonder the more based members of the organization seeked to marginalize the superalignment group.”
“California’s governor should not let irresponsible fear-mongering about AI’s hypothetical harms lead him to take steps that would stifle innovation, kneecap open source, and impede market competition. Rather than passing a law that hinders AI’s technology development, California, and the U.S. at large, should invest in research to better understand what might still be unidentified harms, and then target its harmful applications.”
JD Vance (VP of the US):
“I’m not here this morning to talk about AI safety, which was the title of the conference a couple of years ago,” Vance said. “I’m here to talk about AI opportunity.”
“The AI future is not going to be won by hand-wringing about safety,”
Another comment on “evidence of misalignment”
I claim that many people will “move the goal post” and continue to claim that AI is not capable or dangerous, even as such signs become available. This is extremely common right now. If the reader believes that there is some experimental result that would convince them that the default result of pushing AI development further right now is human extinction, I’d be excited to hear about it, please tell me!
Here is some empirical evidence of this that I find compelling:
I discuss the question of “will AI’s be misaligned” in much greater length in [Lemma 3].
As a final comment, it seems at least somewhat plausible that sufficiently smart AI’s will not take actions which reveal that they are egregiously misaligned in contexts where we can punish them afterwards. That is, it could be the case that a smart AI bides its time before disempowering humans, so that we would have no chance to prepare.
Bob:
Wait but don’t AI’s have to be myopic?
Alice:
Nope. For instance, alignment faking seems like good empirical evidence that AI’s are learning non-myopia. This will only increase as we train AI’s over longer time horizons for more complex tasks. In any case, let’s continue this discussion in the section on Lemma 3.
Actually, as another final comment: It seems very unlikely that “Wow AIs are really good” would be a warning sign that causes people to slow down. Indeed, people’s goal is to build highly capable AI! Achieving this goal will not cause them to be like “oh we messed up”, it’ll cause them to be more excited about further progress.
Rebuttal to counterarg 4.2 (if we notice some danger from AI we’ll coordinate and stop)
Alice:
Unfortunately, even if there was widespread consensus along the lines of what was claimed in 4.2, I don’t think this is particularly to result in regulatory efforts that save us.
The first reason is that, most signs of danger will be used as arguments for “we need to go faster to make sure that we (the good guys!) get AI first—this danger sign is proof that it’d be really bad if some other company or nation got powerful AI before us.”
The second reason is that the level of regulation required is very large, and regulation has had a bad track record in AI thusfar. Some examples of AI regulation’s track record: SB1047 not passed, Biden executive order on AI repealed. In order to really stop progress you’d need to not just set a compute limit on what companies can do, but also regulate hardware improvements / accumulation.
Finally, we invest a lot of money into AI. This makes it very unpalatable to throw the AI away if it looks unsafe. Here’s an approach that feels much more attractive (and is very bad): when you have an AI that makes your “danger detector light” turn on, keep training it until that light turns off. Note that this probably just means your AI learned how to fool the danger light.
People are not taking this seriously
Go look at the X accounts of AI CEOs.
Bob: Why would AI’s hate humans enough to kill us? They don’t even have emotions.
Alice: They are likely to have a different vision for the future from humans and consequentially have the drive to seize control from us. If humans would want to interfere with an AIs vision for the future, then the AI would take actions to ensure that humans are incapable of this interference. The simplest and most robust way to prevent humans from interfering with the AIs goals is to eliminate humanity.
Bob: The world feels continuous. I don’t expect such a radical change to happen. Also, most of the time when someone makes a claim about the world ending, it’s a conspiracy theory / or some weird religion.
Alice: 5 years ago you would have told me that an AI that could sound vaguely human was science-fiction. Now, we have AI systems that can perform complex coding and reasoning tasks in addition to holding engaging conversations. The invention / deployment of nuclear weapons is another example of a time when the world changed discontinuously. This happens sometimes.
Bob: But AI takeover happens in science fiction, thus not in real life!
Alice: Unfortunately, the fact that AI takeover happens in books doesn’t imply that it won’t happen in real life. In fact, books/movies about AI takeover attempts are quite misleading. In books, AIs are stupid: they build a robot army, let the humans fight back, and eventually lose. In real life, AIs will be smart. There will be no cinematographic fight. The AIs will simply wait until they have a robust plan, which they will carry out with precision. At which point we lose.
Bob: But computers just do what we tell them to!
Alice:
First off, computers do not “just do what we tell them to”—and that wouldn’t even be a good property to have. For example, if you ask ClaudeAI to help you make a bomb it will refuse to help.
This already illustrates that AIs are not fundamentally “subservient”. Companies are in the process of making agents that can act in the real world with limited supervision. The plan is to delegate increasingly complex tasks to such agents, until they, for example, can replace the human workforce.
Such agents, if they desired, could clearly cause large amounts of harm, especially as they are given increasing amounts of power, e.g., over the military and critical infrastructure.
Bob: But, if an AI is evil, then we’ll just turn it off!
Alice:
Indeed, this is one of the reasons why a misaligned AI is motivated to take control from the humans—to prevent humans from turning it off. That said, an AI could easily circumvent an “off switch” by spreading over the internet, getting various countries to run instances of the AI, or obtaining its own computing resources. Eventually the AI will need to defend its computing resources with force (e.g., autonomous weapons), but it’s pretty easy to maneuver into a position where it is very hard to “turn off”.
Before this time, a misaligned AI would hide its misaligned goals from humans, thereby getting us to fund its rise to power.
There is already empirical evidence that AI can be strategically deceptive. It’s even more challenging to detect or correct such failures while developing AI in a “race”.
Bob: BUT CHINA!!!
This is the most common argument people give for doing nothing about the situation. Some people are willing to nod along about the risks, but then explain that their hands are tied: we can’t stop, because China won’t stop!
This is not a very good objection—it’s just the easiest excuse for not changing course. If the US leadership becomes convinced that they will lose control of AI unless we stop now, then they’d immediately start urgent talks with China and other world powers to coordinate a global pause, or at least some global safeguards that everyone will follow. The US has a lot of clout—we could get people on board.
Many of the things that need to be done are analogous to the way we deal with other technologies that have potential to impose global negative costs.
Another reason that regulation is possible is that building super-intelligent AI appears to be quite hard. For instance, it requires very specialized computing hardware, and millions of dollars. This can be regulated the same way that Plutonium is regulated.
Bob: Even if an AI was evil I don’t think it could cause that much harm: there are lots of evil humans, and they don’t cause too much harm.
Alice: AI will be diff because it is stronger than us.
Bob: If AI did pose a risk to humanity then people wouldn’t be working on building it.
Alice: Developers don’t think they’ll really die. They feel in control by working on AI.
Conclusion
Alice:
The fact that I have 4 lemmas might make the confused reader think that a lot of things (>= 4) have to go wrong in very particular ways in order for catastrophe to occur.
First of all, I have very high confidence in all 4 lemmas, such that if you union bound over the failure probabilities of each lemma, the probability that any of the 4 lemmas fail is very small.
Second, these lemmas are not the whole story. By which I mean, there are lots of ways that the situation can go horribly wrong for humanity that I haven’t even mentioned.
For instance,
The tension that arises from world powers competing for dominance by building powerful AI could lead to a nuclear war.
Even if AI were aligned to individual users and only ever did what we intended we might still be screwed as a society — there will be intense competitive pressures to replace humans with AI’s, which will result in a world where all resources are controlled by AIs, and once humans have no power, it seems unlikely that they will stay around for long (like animals). (See Krueger’s paper about this.)
If an aligned superintelligence and a misaligned superintelligence are built around the same time, then it’s not clear that the aligned superintelligence can protect us—destruction is easier than preventing destruction.
In many cases I’ve said “here’s the most likely ways that AI’s will be built, and why that would be bad”. But, this is a hard problem. Making minor tweaks to the design of the AI so that it doesn’t match my particular story won’t necessarily fix the problem. We must guard against unwarranted hope, or we will be tricked into thinking that our solution is safe when it is not, and then we will die.
In summary: The situation is robustly bad.
And we should try to make it better.
I think that articulating your thoughts on this problem in writing is a very useful exercise, which I’d highly recommend if you’ve never done it before.
Why I think AI will go poorly for humanity
Epistemic status: This document is a distillation of many conversations that I’ve had over the past couple months about how the “AI situation” will progress. I think that my claims are mostly right, and in places where they aren’t I think the high level claims depend on a large enough disjunction of smaller claims so that errors in particular claims don’t affect the truth value of the disjunctions. However, I think there are almost surely considerations I’m missing. My motivation for sharing this is in part that I hope readers will help me identify considerations that I’m missing, or problems with my thinking about this.
Notes:
This document is poorly written in places, but I think it does not impede understanding too much. Please let me know if I’m wrong about this.
This document was intended to be readable by a general audience who is only broadly familiar with AI, so it contains a lot of basic background on the problem.
Acknowledgements:
The majority of this document will argue that the risk from AI is large. But at the start I’ll list a few counterarguments that I find moderately compelling for why we might be fine:
Nations will adopt the MAIM strategy
Models will generalize goodness by luck
A war might stop AI progress
AI’s will solve alignment for us
As noted earlier, I’d love people to expand this list in the comments!
Introduction
This document presents a dialogue between two imaginary characters: Bob and Alice. Bob is an AI xrisk skeptic who is talking in good faith to his friend Alice who has thought about AI xrisk in depth. Alice will provide arguments for why she believes that AI poses a substantial xrisk, and give rebuttals to Bob’s skeptical claims of why the risk might be fake, or not minimal.
Alice: Hi Bob.
Bob: Hi Alice.
Alice:
Today I’d like to talk about the danger posed to humanity by the development of superintelligent AI.
Bob:
Oh is that a movie that you were watching recently?
Alice:
No, I mean in real life.
Bob:
Oh hmm, that’s a pretty bold claim. I have several reasons why I’m skeptical right off the bat, before even hearing your argument:
Most of the time when someone makes a claim about the world ending, it’s a conspiracy theory.
The world feels very continuous—if there were going to be some drastic change like this, I think I’d have noticed.
Computers just do what we tell them to! It’s not like they have their own goals!
If an AI is evil, then we’ll just turn it off!
Evil AI is a made up science fiction notion.
Even if an AI was evil I don’t think it could cause that much harm: there are lots of evil humans, and they don’t cause too much harm.
If AI did pose a risk to humanity then people wouldn’t be working on building it.
If AI did pose such a risk, then there would be expert consensus about the issue.
So, it’d take some pretty strong reasons and evidence to convince me of this claim. But I think you’re pretty thoughtful, so I’m quite curious to hear you out on why you’re worried.
Alice:
The objections you’ve mentioned are pretty common first impressions to hearing about this issue! I think many of these are reasonable heuristics, which happen to be wrong in this case.
Let’s do the following:
I’ll start by explaining why the current trajectory of AI development is so dangerous.
Then I’ll respond to your above objections and any new objections that you have after hearing my argument.
We then iterate this process until we reach consensus.
Bob: ok, I’m ready.
Alice:
To be clear, here’s the claim I’ll argue for in the rest of the discussion:
Claim 1:
Alice:
Note that I believe the situation is more dire than this, and that the danger comes sooner than 15 years, but getting more people on board with claim 1 seems like it could improve the situation, and so I’m choosing to focus on this.
My argument is factored into 4 steps, please use the links to navigate to whichever part of the argument you’re most interested in.
Lemma 1: ASI is possible soon, if people try hard
Before arguing for lemma 1, we need to define ASI.
I’ll define a ASI to be an AI with a critical score on the openai preparedness framework. Namely: an AI with superhuman abilities at hacking, autonomy, persuasion, CBRN.
More specifically I will require that the AI can do most of these things:
Can do biology research — e.g., figure out how to synthesize novel proteins with a desired function
Can do strategy — e.g., military strategy, or running a company
Can “survive on its own in the wild” e.g., make enough money online to purchase GPUs that it can run inference on
Can hack well enough to promote it’s privileges to root / to escape the lab somehow (or it can do this via social engineering)
I’ll now explain why, if AI corporations don’t drastically alter their trajectory, I expect them to succeed in making ASI in the next couple years. You might think “AI corporations don’t drastically alter their trajectories” is a pretty unreasonable assumption—if they feel like they are close to making something dangerous, wouldn’t they get spooked and stop? I explain why some an AI slowdown is unlikely in [[Lemma 4]]. In this section, I’ll assume that people are trying as hard as they are right now to make ASI, and see what the world looks like if this continues.
Argument 1 for Lemma 1:
There are four major ingredients that go into making powerful models, which could potentially bottleneck progress:
Some people have crunched some numbers on this, and we still have substantial room to grow in these four areas. Also there’s a trend of cost decreasing rapidly over time. E.g., o3-mini is comparable or better than o1 at much reduced cost. If there is going to be some big bottleneck, (e.g., energy), I expect labs to be proactive about it (e.g., build nuclear power plants).
AI companies are already being proactive about building the infra they need to scale up, see project stargate.
Argument 2 for Lemma 1:
GPT3 -- high schooler
GPT4 -- college student
o3 -- Comparable with CS/Math Olympiad competitors on close ended ~6 hour long tests. Quite good at SWE.
Of course, now that these systems exist, people are finding lots of reasons to claim that they aren’t impressive. We humans grow accustomed to things remarkably quickly. In order to notice that you are surprised, you must make predictions and have them be falsified.
A misconception that I find baffling is that AI capabilities will stop at human level. People say a lot of words about training data, and seem to neglect the fact that there are known methods for AI’s bootstrapping to superhuman performance on tasks (namely, RL). For instance, if you want to imagine how you could get an AI that is better than humans at math, here’s how you could do it:
Start by training it to solve some easy math problems, maybe by imitating humans.
Once the AI is good at that, slowly increase the difficulty, and let the AI get really good at the new difficulty level.
The AI does not need humans to score it’s performance—it can just check the answers. This provides a nice signal for RL.
Some people argue that ASI is impossible because “computers can’t be intelligent”, or “humans are special, so we’ll always be able to do some things better than AI’s. I find Tegmark’s analogy helpful here: “the universe doesn’t check if the heat-seeking missile is sentient before having it explode and kill you”. It seems pretty clear that intelligence is just a bunch of hacks. Nature was able to evolve humans to be smart, why could RL not do the same for AIs?
Argument 3 for lemma 1
It seems likely, but not essential for the argument, that AI’s can speed up AI research a lot soon. In some sense, they are already doing this—e.g., if you use an AI code completion tool or ask an AI to make a plot for you that’s speeding stuff up. A lot of R&D is “grunt work” / implementing things (as far as I know). I think the amount by which AI’s could speed up research is substantial. Here’s a framework for how this could happen:
For instance, soon I believe that you’ll be able to tell an AI to go do some experiment and it can just go to it.
Not long after this, I think you can ask the AI “what would be good experiments to do” and it’ll give good suggestions.
At this point, you don’t need as many humans in the loop proportional to the amount of research that goes on — humans can give high level objectives to the AI and the AI can make and execute a research program to achieve those goals.
Appeal to authority on Lemma 1
I find it fairly compelling that there are many AI experts—both academics and people in industry—think that believe AI progress will be fast. Here are some quotes.
Survey
Bob Objection:
People in industry have COI for claiming powerful AI soon—this could get people excited about AI, and drive up investment.
People in AI safety have COI for claiming powerful AI soon—this could get them more funding.
Alice:
I don’t think the expert opinions are conclusive on their own, just another piece of evidence that makes the situation look bad.
Some reasons why I’m inclined to take the data more seriously than you are:
If timelines aren’t as short as people say, it’ll hurt their reputation.
People like Geoff Hinton are willing to do hard things like quitting his job and saying that he regrets his life’s work in order to talk about this.
Lemma 2: ASI could destroy humanity
Bob: How will the AIs take over?
Alice:
I’m not sure but a smart AI will be able to come up with something. In case it helps your intuition, here are a couple plausible takeover paths:
Military Integration: Nations could place AI in charge of autonomous weapons (either of actually controlling them or developing software to control them). A misaligned ASI could use that authority to stage a coup. Alternatively, an AI could obtain control of drones via hacking. Note that AI companies are already partnering with defense companies to integrate AI into the military.
Biological Research: AIs put in charge of curing diseases could use their understanding of biology to craft a novel supervirus (much worse than COVID19) to wipe out humanity. Note that AI has already had large success in biology (e.g., alphafold).
Social Manipulation: An AI could craft a Deepfake of the president declaring nuclear war, triggering other nations to launch missiles, leading to a nuclear winter. (To cause a war might require a bit more work than this, but a manipulative AI could stoke tensions online and trick nations into war, possibly using hacking capabilities to make it look like the other side is attacking.)
There’s also an interesting scenario, described by Paul Christiano here, where we lose control more gradually, say over the course of 1-2 decades. One way this could play out is that we become highly reliant on AI systems, the world changes rapidly and becomes extremely complicated, so that we don’t have any real hopes of understanding it anymore. Maybe we end up only being able to interface with the world through AI’s and thereby lose our agency.
But anyways, I view it as extremely obvious that ASI could destroy humanity.
So basically what I’m saying is, yes we should invest resources into protecting against obvious threat-models like synthbio where it’s clear that an unaligned agent could do harm. But we shouldn’t feel too good about ourselves for preventing an AI from taking over using the methods that humans would try. We shouldn’t feel confident at all that an ASI couldn’t come up with some strategy that we didn’t think of and exploits it. AI’s do the unexpected all the time.
Bob: Well what if we just “sandbox the AI really hard”
Alice:
First of all, this seems pretty economically / politically infeasible. See [[Lemma 4]] for more discussion. But basically, if AI is super capable and we’ve thrown a bunch of money into creating it, the first thing people are going to do is widely deploy it. AI will be connected to the internet, talking to people, autonomously pursuing goals, and so on.
Bob: But suppose humanity wasn’t really that dumb.
Alice: Well, it’s better than the status quo but still definitely not sufficient once the AI gets highly capable.
If you build powerful AI “not deploying it” doesn’t make it safe. It can escape the lab—either by hacking (if you even want to call it that; if we put the AI in charge of our security then it’s better described as collusion with other AIs / itself) or by social engineering (convince a human employee to help it escape; bonus points if that human is a spy from another nation). This is kind of like making a super dangerous bioweapon and saying it’s fine because labs never accidentally leak viruses (they do). And anyways people will use it inside the lab to do useful work so it’s not clear how different this from being deployed.
Appeal to authority on Lemma 2
Also—I’m pretty confused about why Sam would lie point blank about this quote
I don’t know much about Sam besides watching a few interviews. But the change in tone from 2015 to 2025 is pretty concerning. In any case such speculations are probably moot—it’s the action of building superintelligence that is the real danger.
More quotes here: https://controlai.com/quotes
Bob:
Well, but some people like Yan LeCun say there’s no risk
Alice:
Well, he has a large conflict of interest though (his job is making powerful AIs) so it’s not so surprising that he argues against the risk — it’d be quite inconvenient (e.g., he might have to stop working on AI capabilities and instead focus on safety) to admit the risk (and would damage his pride because he’s been so vocal that there are no risks). Another thing that discredits him is that he has really bad arguments.
Lemma 3: ASI will likely be misaligned
Alice:
There are lots of ways that building powerful AI’s could go wrong. Often people focus on “misuse”: you have an AI that is well described as an “instruction follower”, and you tell it to do something evil. I’ll define a misuse issue as “you intended for something evil to happen, and then that evil thing happened”. If you told your AI “please go destroy Europe” and it did it, this counts as misuse. (Note that people have tried telling AIs to do this, but it hasn’t worked out yet “ChaosGPT”).
I personally think that concerns around misuse are a compelling argument for not building powerful AI until we have better global coordination. But I won’t discuss this here. If nations think of AI as a powerful tool that grants them dominance, the whole “we must obtain this tool first” rhetoric makes sense, if you condition on coordinating so that no one gets the tool as being impossible. So, I think it’s pretty important to understand that AI corporations are not planning on building “tools” that empower the user. They are planning on building maximally general and agentic AI. In this section I’ll argue that current ML techniques pushing towards this goal are highly unlikely to produce AI with goals that are compatible with human well-being / existence.
Anyways, before we can talk about AI’s having “misaligned goals” we need to talk about what it means for AI’s to have goals.
AIs have goals
What I mean by “AI’s have goals” is: “AI’s perform sequences of actions to try to achieve a particular outcome”. Note that by this definition, it’s accurate to say that a narrow chess AI has the goal of trying to win at chess. Goal directedness is mostly only concerning when it comes with general intelligence and therefore a large action space. If you tell a general intelligence to play chess, then it’s action space includes “hack the chess engine that it’s competing against”, not just “move piece at D4 to D5″.
Claim: Current AIs are well-described as having “goals”. Future AI agents that are expected to carry out tasks over long time horizons will be even better described as having “goals”. If a team of engineers made an AI with a lifespan of (say) a year, with persistent memory that it could read/write to without human supervision (say because the memory was just a bunch of numbers, not text), then this AI would be very well described as having goals. (Note: please do not create such an AI).
This “goal directedness” makes a lot of sense. The thing that AI companies are trying to build is a “genie”: you put in a task, and it spits out a sequence of actions that accomplishes the task. Right now what this looks like is you give Claude some broken code and tell Claude to fix the code. Claude thinks about this (say with CoT), tries several approaches, reasons about which one is best and what they all would do, and then chooses one.
As we make smarter AI’s, the AI’s will think a lot about the best way to achieve their goals.
As discussed in section 2, it’d be really bad if an AI has goals which would incentivize it to destroy us (because if it had such goals, then it could succeed in destroying us). So it’s worth thinking about what type of goals AI systems will have.
General considerations that favor misalignment
Alice:
In a moment, I’ll discuss the current alignment “plan” being advertised by labs, and why it is unlikely to have good consequences (the argument actually applies to a fairly general class of ways that you might train an AI, but anyways). First however, I’d like to list some general considerations to give intuition for why alignment is hard in general.
Human Specified Goals Can Have Unintended Consequences—and AI’s aren’t obligated by nature to do what we intended instead of what we asked for.
Human Values are Complex: There are lots of often conflicting things that we have to balance.
Misalignment is Hard to Detect—As advanced AI could hide its true intentions to avoid being shut down. Research shows AI systems can already engage in strategic deception, appearing aligned only under observation.
Training May Select the Wrong Internal Motivations—Modern AI systems learn through rewards rather than direct programming. An AI can appear aligned by mimicking desired behaviors without truly sharing human values.
An Arms Race Mentality Magnifies the Risk: Rapid AI advancement encourages competition among corporations and governments, pressuring teams to cut corners on safety.
We are fighting entropy. This is tough.
Instrumental convergence:
To achieve most goals, it’s helpful to eliminate competition that could interfere with you. Humans could be a danger to AI because we could try to bomb it or try to build another AI to challenge the first AI. A smart AI does not leave intelligent adversaries around to challenge it.
Also, obtaining resources is a helpful goal almost always. The AI needs the land that humans are sitting on for GPU clusters and power plants.
Code always has bugs when you first write it. ML is even worse, because we basically have to try to understand the system as a black box. The plan for fixing this is to iterate. Iterating on “preventing the AI from killing you” is problematic because if we all die, we can’t go back and try again.
The current “plan”
The current plan is something like this [https://openai.com/index/deliberative-alignment/](deliberative alignment). The basic gist is as follows: (please correct me if I’m misunderstanding)
First you do pre-training
Then you SFT it to be a helpful assistant
Next you do RL to make it good at producing long CoTs to solve complex technical questions (e.g., math / coding)
Then you SFT it on some examples of being safe / refusing harmful queries
Then, you do an RL loop where the model produces some outputs, and a different AI evaluates these outputs based on a spec / constitution written by humans, and gives a score to provide the reward signal.
Then the humans test somehow, with some AI assistance, try to check whether their RL procedure worked. And if it didn’t, they mess with the reward signal until it seems like it produced an AI that they can deploy without getting bad PR because it told someone how to make a bioweapon.
Kokotajlo wrote an excellent blog post summarizing different goals that people think might arise via this process. Here’s the list of what an AI might care about (mostly borrowed from Kokotajlo’s post):
Written goals
Human intention
Some subset of the written goals stick
or when there is ambiguity, model interprets in the easiest way
Starts caring about whatever metric it was trained for
“try to do actions that would be reinforced if the reinforcement scheme from training was still running”
Learn proxies that do well on distribution but are v bad OOD
special case: instrumental goals
Simulator:
Maybe goals are based on sci-fi tropes / societal expectations about AI in the training data
Some properties of this AI that we will assume are:
The AI is situationally aware; current models already are a little bit, this AI is much moreso.
The AI is quite good at understanding instructions (as good as a smart human at least).
My quick take is that I suspect AI’s will have some weird combination of 3, 4, 5, 6.
Why I’m pessimistic that AI’s will care about the constitution
AI’s trained for AI R&D (or some other technical task) get very dense RL feedback based on how well they did at training powerful AI models according to some metrics. They get fairly sparse feedback on (1) the constitution, (2) developer intent. Performance on “make AI models that our evals say are powerful / write code such that the test cases pass” is fairly easy to measure and is a very simple goal. These reasons combine to make “getting a high evaluation number on whatever concrete technical task we were RL-ing the the AI for)” a deeply ingrained goal.
This creates some tension between the AIs and the developers, because there are a lot of situations where the simple goal from above is at odds with what the developers want.
Recent openai paper is some empirical evidence for the claim that “training against detected bad behavior” leads to “do more subtle bad behavior”. My theoretical argument for why I expect this to be the case is that the “get good score on metric / test case” goal is deeply instilled, so trying to nudge the weights around to make the AI care about “not cheating” on tasks actually just teaches is to not get caught.
Another reason I’m pretty pessimistic about “AI’s will end up caring about the constitution” is that plausibly really soon AI’s will be sufficiently situationally aware that they realize that certain types of reward hacking have a decent chance of getting caught, and because of this decide to not fight with the monitor in ways that give a feedback signal to un-reinforce this bad mendency. You might argue that “AI’s will be too myopic to do this”; but, being non-myopic seems like a convergent goal—for instance, if you are a myopic SWE and write crappy code that’s faster to write but such that the next AI can’t build on it then I suspect you get penalized. Having the bigger picture seems to be helpful for the short tasks—and is a very reasonable generalization. Probably more importantly, we will actively be selecting for non-myopic AIs.
Anyways, once that starts happening everything looks really nice. The AI company doesn’t ask too many questions because this looks like exactly the result they wanted (a nice AI following constitution!), and because many of their employees weren’t too sold on the idea of their helpful AI assistants having misaligned goals anyways.
But eventually the AI performing the task and the monitor can collude to both do better. Eventually (where I don’t actually mean very long—why wait, there’s utility to be obtained!) this looks like disempowering humans (and simplest, most robust way to do this is destroying them).
Would it even be good if AIs cared about the constitution?
Here are a couple reasons why I think that this wouldn’t actually be that good:
“Care about constitution” is pretty vague and leaves a lot of freedom for interpretation, especially when there are competing considerations in the constitution. I suspect that an AI will prioritize the ones that are easiest to measure. Caring about the letter of the constitution would have many unintended consequences anyways. If you have an AI do philosophy, you might not like the result. For instance, the AI might decide that because humans cause lots of suffering, they should be stopped.
Empirical evidence for misalignment
Sometimes AI’s adopt concerning personalities: Bing was pretty evil.At current capability levels this isn’t too dangerous. It would be very dangerous if the AI is highly capable.
Claude is willing to lie very hard to protect its goals. I think this is an especially important result. The idea is as follows. Suppose you’re interviewing for a job and you don’t really care about the job. You might still pretend to care, because you want the job. With an AI it’s the same way. Once an AI is “situationally aware”, i.e., it understands that it is an AI that is being trained, we need to worry that the AI will do whatever actions it knows we want it to do, not because it wants to do them, but because it knows that we’ll change it’s values if it doesn’t act this way. You might even predict that alignment fakers are selected for because they do better on tasks. If an AI “crystallizes” on some weird values, and then defends these values, then its behavior will look identical to an agent that cares about what we want it to care about during training. However, if it gets a chance to pursue its true goals in deployment, it may take it. I find Greenblatt et al.’s paper to be convincing evidence that AIs can and will be deceptively aligned.
Grok seemed to have some strange opinions
AI’s trying to exfiltrate their weights, or sandbag on evals
AI’s fine tuned on a bit of code with vulnerabilities generalized to being extremely evil
AI’s engage in reward hacking
What if we just have AI’s solve alignment?
The other main plan proposed by AI corporations is to have AI’s come up with a better alignment solution than deliberative alignment / constitutional AI.
This is worth a shot, but there are several reasons why this is not so promising:
There is likely to be an extremely short window of time where AI’s are more capable at doing alignment research than humans, but are not yet capable of takeover, conditional on them being misaligned.
It is very likely that it’s easier to get “plausible sounding solutions” with subtle problems out of AIs than legit solutions. If the reward model can’t distinguish between these two things, then you can’t incentivize one versus the other.
Bob:
Well, what if we just have humans “solve alignment”?
Alice:
Again this is worth a shot but seems pretty hard.
Lemma 4: People will try to make ASI
Bob: People won’t build ASI if it would kill them
Alice: Let’s talk about that.
“Our mission is to ensure that AGI (Artificial General Intelligence) benefits all of humanity.”—OpenAI
There are several groups working to make powerful AI—credit to Leahy et al https://www.thecompendium.ai/ for this decomposition.
The stated goal of AI labs like OpenAI, Anthropic, Google DeepMind, xAI, and Deepseek is to build AI systems which are generally intelligent, and which surpass human abilities. These companies are very excited about this goal, and often preach that powerful AI will usher in a utopia—curing diseases, eliminating scarcity, and enabling an unprecedented rate of beneficial scientific discoveries.
“Big tech” is excited about building powerful AI because they think that it can make a lot of money. AI researchers and engineers feel that this is an exciting project, and it gives them status and money when they succeed in making more powerful AIs. Governments are getting excited about AI because they smell that AI will improve the economy, and because they feel that it is important for maintaining military dominance. stargate
This is all to say that currently humanity is racing towards AI very fast and with large momentum.
Counterarg 4:
4.1 If AI is going to be dangerous by some time t, then this fact will become obvious and widely accepted years before time t
4.2 If 4.1 happens then people will stop developing stronger AIs.
4.3 In general, I expect humanity to rise to the challenge—to give a response with competence proportional to the magnitude of the issue.
Bob’s argument:
In order to takeover AI will need scary capabilities
There will be small catastrophes before there are large catastrophes
No one wants to die so we’ll coordinate at this point.
Alice:
I actually disagree with all 3 of your claims.
Let me address them in turn.
Rebuttal of counterarg 4.3 (we’ll rise to the challenge):
Analogies to historical threats
To give an initial guess about how responsibly humanity will respond to the threat posed by ASI we can think of historical examples of humanity responding to large threats. Some examples that come to mind include:
How humanity deals with nuclear missiles.
How humanity deals with pandemics / danger from engineered viruses.
How humanity deals with climate change.
Here is one account of a close call to nuclear war:
“A Soviet early warning satellite showed that the United States had launched five land-based missiles at the Soviet Union. The alert came at a time of high tension between the two countries, due in part to the U.S. military buildup in the early 1980s and President Ronald Reagan’s anti-Soviet rhetoric. In addition, earlier in the month the Soviet Union shot down a Korean Airlines passenger plane that strayed into its airspace, killing almost 300 people. Stanislav Petrov, the Soviet officer on duty, had only minutes to decide whether or not the satellite data were a false alarm. Since the satellite was found to be operating properly, following procedures would have led him to report an incoming attack. Going partly on gut instinct and believing the United States was unlikely to fire only five missiles, he told his commanders that it was a false alarm before he knew that to be true. Later investigations revealed that reflection of the sun on the tops of clouds had fooled the satellite into thinking it was detecting missile launches (Schlosser 2013, p. 447; Hoffman 1999).”
About 20 more accounts are documented here: https://futureoflife.org/resource/nuclear-close-calls-a-timeline/. For many of these, if the situation had been slightly different, if an operator was in a slightly different mood, nuclear missiles would have been launched, a conflict would have escalated, and a very large number of people would have died. Possibly, we would have had a nuclear winter (put enough ash into the air such that crops would fail) and killed >1 billion people. Humanity has definitely done some things right in handling nukes: we have some global coordination to monitor nuclear missiles, and people are pretty committed to not using them. But the number of close calls doesn’t make humanity look super great here.
The government pays for synthetic biology research, because of potential medical and military applications. My understanding is that sometimes synthetic viruses are leaked from labs. It seems possible that COVID-19 was engineered. Doing synthetic biology research that could potentially create really dangerous bioweapons does not seem like a really smart thing to do. Once you discover a destructive technology, it’s hard to undiscover it.
There have also been incidents of people publishing, e.g., instructions for how to make small pox. Our current social structure doesn’t have a system in place for keeping dangerous ideas secret.
Humanity seems to not be trying too hard to prevent climate change, even though there is broad scientific consensus about this issue. Many people even still claim that climate change is fake!
It’s also instructive to think about some more mundane historical situations, such as the introduction of television and social media. Many people feel that these technologies have done a lot of harm, but people didn’t carefully think about this before releasing them, and now it’s virtually impossible to “take back” these inventions. This gives some evidence that there is a precedent in technology of creating whatever stuff we can, releasing it, and then living with whatever the consequences happen to be. I think this is happening a lot in AI already—for instance, it’d be very challenging to ban deepfakes at this point.
The threat from AI is harder to handle (than nukes for instance)
These are some analogies that indicate that humanity might not have such an inspiring response to the AI threat.
However, the situation with AI is actually substantially worse than the threats listed above, for reasons I now describe.
AI moves extremely fast.
At least currently, there is little consensus about the risk, and I don’t predict that this will change much (see rebuttal of counterarg 4.1). This makes it very hard for politicians to do anything.
See the beginning of this section for all of the upsides from AI that people are excited about. (AI spits out money and coolness until it destroys you).
People understand bombs. It’s pretty easy to say “yup bombs are bad”. People are not used to thinking about dealing with a species that is smarter than humans (namely, powerful AIs). The danger from AI is unintuitive, because we are used to machines being tools, and we aren’t used to dealing with intelligent adversaries.
Rebuttal of counterarg 4.1 (there will be consensus before danger).
Here are several reasons why I don’t expect there to be consensus about the danger from AI before catastrophe:
For labs, it would be extremely convenient (with respect to their business goals) to believe that their AI doesn’t pose risks. This biases them towards searching for arguments to proceed with AI development instead of arguments to stop. It is also extremely inconvenient for an individual person working on pushing forward capabilities to stop—for instance, they’d need to find a new job, and they probably enjoy their job, and the associated money and prestige.
It’s nearly impossible to build consensus around a theoretical idea. People have a strong intuition that creating a powerful technology gives themselves more power.
Humans have a remarkable ability to acclimatize to situations. For instance, the fact that AIs are better than humans at competitive programming and can hold fluent conversations with us now feels approximately normal to many. As capabilities improve we’ll keep finding reasons to think that this was merely expected and nothing to be concerned about.
Many people reason based on “associations” rather than logic. For instance, they think that technology progress is intrinsically good.
Many people have already firmly established that they believe anyone who believes in risk from AI is an idiot, and it would hurt their pride to revise this assessment—this creates a bias towards finding reasons to not believe in risks.
Here are some examples of this from prominent figures in ML:
Yan LeCun:
“It seems to me that before “urgently figuring out how to control AI systems much smarter than us” we need to have the beginning of a hint of a design for a system smarter than a house cat. Such a misplaced sense of urgency reveals an extremely distorted view of reality. No wonder the more based members of the organization seeked to marginalize the superalignment group.”
Andrew Ng:
“California’s governor should not let irresponsible fear-mongering about AI’s hypothetical harms lead him to take steps that would stifle innovation, kneecap open source, and impede market competition. Rather than passing a law that hinders AI’s technology development, California, and the U.S. at large, should invest in research to better understand what might still be unidentified harms, and then target its harmful applications.”
JD Vance (VP of the US):
“I’m not here this morning to talk about AI safety, which was the title of the conference a couple of years ago,” Vance said. “I’m here to talk about AI opportunity.”
“The AI future is not going to be won by hand-wringing about safety,”
Another comment on “evidence of misalignment”
I claim that many people will “move the goal post” and continue to claim that AI is not capable or dangerous, even as such signs become available. This is extremely common right now. If the reader believes that there is some experimental result that would convince them that the default result of pushing AI development further right now is human extinction, I’d be excited to hear about it, please tell me!
Here is some empirical evidence of this that I find compelling:
Bing was pretty evil
Claude is willing to lie very hard to protect its goals
Grok seemed to have some strange opinions
AI’s trying to exfiltrate their weights, or sandbag on evals
AI’s fine tuned on a bit of code with vulnerabilities generalized to being extremely evil
AI’s engaging in reward hacking
I discuss the question of “will AI’s be misaligned” in much greater length in [Lemma 3].
As a final comment, it seems at least somewhat plausible that sufficiently smart AI’s will not take actions which reveal that they are egregiously misaligned in contexts where we can punish them afterwards. That is, it could be the case that a smart AI bides its time before disempowering humans, so that we would have no chance to prepare.
Bob:
Wait but don’t AI’s have to be myopic?
Alice:
Nope. For instance, alignment faking seems like good empirical evidence that AI’s are learning non-myopia. This will only increase as we train AI’s over longer time horizons for more complex tasks. In any case, let’s continue this discussion in the section on Lemma 3.
Actually, as another final comment: It seems very unlikely that “Wow AIs are really good” would be a warning sign that causes people to slow down. Indeed, people’s goal is to build highly capable AI! Achieving this goal will not cause them to be like “oh we messed up”, it’ll cause them to be more excited about further progress.
Rebuttal to counterarg 4.2 (if we notice some danger from AI we’ll coordinate and stop)
Alice:
Unfortunately, even if there was widespread consensus along the lines of what was claimed in 4.2, I don’t think this is particularly to result in regulatory efforts that save us.
The first reason is that, most signs of danger will be used as arguments for “we need to go faster to make sure that we (the good guys!) get AI first—this danger sign is proof that it’d be really bad if some other company or nation got powerful AI before us.”
The second reason is that the level of regulation required is very large, and regulation has had a bad track record in AI thusfar. Some examples of AI regulation’s track record: SB1047 not passed, Biden executive order on AI repealed. In order to really stop progress you’d need to not just set a compute limit on what companies can do, but also regulate hardware improvements / accumulation.
Finally, we invest a lot of money into AI. This makes it very unpalatable to throw the AI away if it looks unsafe. Here’s an approach that feels much more attractive (and is very bad): when you have an AI that makes your “danger detector light” turn on, keep training it until that light turns off. Note that this probably just means your AI learned how to fool the danger light.
People are not taking this seriously
Go look at the X accounts of AI CEOs.
Bob: Why would AI’s hate humans enough to kill us? They don’t even have emotions.
Alice: They are likely to have a different vision for the future from humans and consequentially have the drive to seize control from us. If humans would want to interfere with an AIs vision for the future, then the AI would take actions to ensure that humans are incapable of this interference. The simplest and most robust way to prevent humans from interfering with the AIs goals is to eliminate humanity.
Bob: The world feels continuous. I don’t expect such a radical change to happen. Also, most of the time when someone makes a claim about the world ending, it’s a conspiracy theory / or some weird religion.
Alice: 5 years ago you would have told me that an AI that could sound vaguely human was science-fiction. Now, we have AI systems that can perform complex coding and reasoning tasks in addition to holding engaging conversations. The invention / deployment of nuclear weapons is another example of a time when the world changed discontinuously. This happens sometimes.
Bob: But AI takeover happens in science fiction, thus not in real life!
Alice: Unfortunately, the fact that AI takeover happens in books doesn’t imply that it won’t happen in real life. In fact, books/movies about AI takeover attempts are quite misleading. In books, AIs are stupid: they build a robot army, let the humans fight back, and eventually lose. In real life, AIs will be smart. There will be no cinematographic fight. The AIs will simply wait until they have a robust plan, which they will carry out with precision. At which point we lose.
Bob: But computers just do what we tell them to!
Alice: First off, computers do not “just do what we tell them to”—and that wouldn’t even be a good property to have. For example, if you ask ClaudeAI to help you make a bomb it will refuse to help.
This already illustrates that AIs are not fundamentally “subservient”. Companies are in the process of making agents that can act in the real world with limited supervision. The plan is to delegate increasingly complex tasks to such agents, until they, for example, can replace the human workforce.
Such agents, if they desired, could clearly cause large amounts of harm, especially as they are given increasing amounts of power, e.g., over the military and critical infrastructure.
Bob: But, if an AI is evil, then we’ll just turn it off!
Alice:
Indeed, this is one of the reasons why a misaligned AI is motivated to take control from the humans—to prevent humans from turning it off. That said, an AI could easily circumvent an “off switch” by spreading over the internet, getting various countries to run instances of the AI, or obtaining its own computing resources. Eventually the AI will need to defend its computing resources with force (e.g., autonomous weapons), but it’s pretty easy to maneuver into a position where it is very hard to “turn off”.
Before this time, a misaligned AI would hide its misaligned goals from humans, thereby getting us to fund its rise to power.
There is already empirical evidence that AI can be strategically deceptive. It’s even more challenging to detect or correct such failures while developing AI in a “race”.
Bob: BUT CHINA!!!
This is the most common argument people give for doing nothing about the situation. Some people are willing to nod along about the risks, but then explain that their hands are tied: we can’t stop, because China won’t stop!
This is not a very good objection—it’s just the easiest excuse for not changing course. If the US leadership becomes convinced that they will lose control of AI unless we stop now, then they’d immediately start urgent talks with China and other world powers to coordinate a global pause, or at least some global safeguards that everyone will follow. The US has a lot of clout—we could get people on board.
Many of the things that need to be done are analogous to the way we deal with other technologies that have potential to impose global negative costs.
Another reason that regulation is possible is that building super-intelligent AI appears to be quite hard. For instance, it requires very specialized computing hardware, and millions of dollars. This can be regulated the same way that Plutonium is regulated.
Bob: Even if an AI was evil I don’t think it could cause that much harm: there are lots of evil humans, and they don’t cause too much harm.
Alice: AI will be diff because it is stronger than us.
Bob: If AI did pose a risk to humanity then people wouldn’t be working on building it.
Alice: Developers don’t think they’ll really die. They feel in control by working on AI.
Conclusion
Alice:
The fact that I have 4 lemmas might make the confused reader think that a lot of things (>= 4) have to go wrong in very particular ways in order for catastrophe to occur.
First of all, I have very high confidence in all 4 lemmas, such that if you union bound over the failure probabilities of each lemma, the probability that any of the 4 lemmas fail is very small.
Second, these lemmas are not the whole story. By which I mean, there are lots of ways that the situation can go horribly wrong for humanity that I haven’t even mentioned.
For instance,
The tension that arises from world powers competing for dominance by building powerful AI could lead to a nuclear war.
Even if AI were aligned to individual users and only ever did what we intended we might still be screwed as a society — there will be intense competitive pressures to replace humans with AI’s, which will result in a world where all resources are controlled by AIs, and once humans have no power, it seems unlikely that they will stay around for long (like animals). (See Krueger’s paper about this.)
If an aligned superintelligence and a misaligned superintelligence are built around the same time, then it’s not clear that the aligned superintelligence can protect us—destruction is easier than preventing destruction.
In many cases I’ve said “here’s the most likely ways that AI’s will be built, and why that would be bad”. But, this is a hard problem. Making minor tweaks to the design of the AI so that it doesn’t match my particular story won’t necessarily fix the problem. We must guard against unwarranted hope, or we will be tricked into thinking that our solution is safe when it is not, and then we will die.
In summary: The situation is robustly bad. And we should try to make it better.
I think that articulating your thoughts on this problem in writing is a very useful exercise, which I’d highly recommend if you’ve never done it before.