The Hopium Wars: the AGI Entente Delusion
As humanity gets closer to Artificial General Intelligence (AGI), a new geopolitical strategy is gaining traction in US and allied circles, in the NatSec, AI safety and tech communities. Anthropic CEO Dario Amodei and RAND Corporation call it the “entente”, while others privately refer to it as “hegemony” or “crush China”. I will argue that, irrespective of one’s ethical or geopolitical preferences, it is fundamentally flawed and against US national security interests.
If the US fights China in an AGI race, the only winners will be machines
The entente strategy
Amodei articulates key elements of this strategy as follows:
“a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust military superiority (the stick) while at the same time offering to distribute the benefits of powerful AI (the carrot) to a wider and wider group of countries in exchange for supporting the coalition’s strategy to promote democracy (this would be a bit analogous to “Atoms for Peace”). The coalition would aim to gain the support of more and more of the world, isolating our worst adversaries and eventually putting them in a position where they are better off taking the same bargain as the rest of the world: give up competing with democracies in order to receive all the benefits and not fight a superior foe.”
[…]
This could optimistically lead to an ‘eternal 1991’—a world where democracies have the upper hand and Fukuyama’s dreams are realized.”
Note the crucial point about “scaling quickly”, which is nerd-code for “racing to build AGI”. The question of whether this argument for “scaling quickly” is motivated by self-serving desires to avoid regulation deserves a separate analysis, and I will not further comment on it here except for noting that most other industries, from big tobacco to big oil, have produced creative anti-regulation arguments to defend their profits.
Why it’s a suicide race
If the West pursues this entente strategy, it virtually guarantees that China will too, which in turn virtually guarantees that both sides will cut corners on safety to try to “win” the race. The key point I will make is that, from a game-theoretic point of view, this race is not an arms race but a suicide race. In an arms race, the winner ends up better off than the loser, whereas in a suicide race, both parties lose massively if either one crosses the finish line. In a suicide race, “the only winning move is not to play”, as the AI concludes at the end of the movie WarGames.
Why is the entente a suicide race? Why am I referring to it as a “hopium” war, fueled by delusion? Because we are closer to building AGI than we are to figuring out how to align or control it.
There is some controversy around how to define AGI. I will stick with the original definition from Shane Legg as AI capable of performing essentially all cognitive tasks at least at human level. This is similar to OpenAI’s stated goal of automating essentially all economically valuable work.
Although it is highly controversial how close we are to AGI, it is uncontroversial that timelines have shortened. Ten years ago, most AI researchers predicted that something as advanced as ChatGPT-4 was decades away. Five years ago, median predictions by AI researchers were that AGI was decades away. The Metaculus prediction for weak AGI has now dropped to 2027. In his influential Situational Awareness piece, Leopold Aschenbrenner argues that AGI by 2027 is strikingly plausible, and Dario Amodei has made a similar prediction. Sam Altman, Demis Hassabis, Yoshua Bengio, Geoff Hinton and Yann LeCun have all recently described AGI in the next 5-15 years as likely. Andrew Ng and many others predict much longer timelines, but we clearly cannot discount the possibility that it happens in the coming decade. When specifically we get AGI is irrelevant to my argument, which is simply that it will probably happen before the alignment/control problem is solved. Just as it turned out to be easier to build flying machines than to build mechanical birds, it has turned out to be simpler to build thinking machines than to understand and replicate human brains.
In contrast, the challenge of building aligned or controllable AGI has proven harder than many researchers expected, and there is no end in sight. AI Godfather Alan Turing argued in 1951 that “once the machine thinking method had started, it would not take long to outstrip our feeble powers. At some stage therefore we should have to expect the machines to take control.” This sounds like hyperbole if we view AGI as merely another technology, like the steam engine or the internet. But he clearly viewed it more as a new smarter-than-human species, in which case AGI taking over is indeed the default outcome unless some clever scheme is devised to prevent it. My MIT research group has pursued AI safety research since 2017, and based on my knowledge of the field, I consider it highly unlikely that such a clever scheme will be invented in time for AGI if we simply continue “scaling rapidly”.
That is not to say that nobody from big tech is claiming that they will solve it in time. But given the track record of companies selling tobacco, asbestos, leaded gasoline and other fossil fuels downplaying risks of their products, it is prudent to scientifically scrutinize their claims.
The two traditional approaches are either figuring out how to control something much smarter than us via formal verification or other techniques, or do make control unnecessary by “aligning” it: ensuring that it has goals aligned with humanity’s best interests, and that it will retain these goals even if it recursively self-improves its intelligence from roughly human level to astronomically higher levels allowed by the laws of physics.
There has been a major research effort on “alignment” redefined in a much narrower way: ensuring that a large language model (LLM) does not produce outputs deemed harmful, such as offensive slurs or bioweapon instructions. But most work on this has involved only training LLMs not to say bad things rather than not to want bad things. This is like training a hard-core Nazi never to say anything revealing his Nazi views – does this really solve the problem, or simply produce deceptive AI? Many AI systems have already been found to be deceptive, and current LLM blackbox evaluation techniques are likely to be inadequate. Even if alignment can be achieved in this strangely narrowly defined sense, it is clearly a far cry from what is needed: aligning the goals of AGI.
If your reaction is “Machines can’t have goals!”, please remember that if you are chased by a heat-seeking missile, you do not care whether it is “conscious” or has “goals” in any anthropomorphic sense, merely about the fact that it is trying to kill you.
We still lack understanding of how to properly measure or even define what goals are in an LLM: although its training objective is just to predict the next word or token, its success requires accurately modeling the goals of the people who produced the words or tokens, effectively simulating various human goal-oriented behaviors.
As if this were not bad enough, it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system. Today’s most capable AI’s are already hybrids, where LLMs are scaffolded with long-term memory, code compilers, databases and other tools that they can use, and where their outputs are not raw LLM outputs, but rather the result of multiple calls to LLMs and other systems. It is highly likely that this hybridization trend will continue, combining the most useful aspects of neural network-based AI with traditional symbolic AI approaches. The research on how to align or control such hybrid systems is still in an extremely primitive state, where it would be an exaggeration to claim even that it is a coherent research field. “Scaling quickly” is therefore overwhelmingly likely to lead to AGI before anyone figures out how to control or align it. It does not help that the leading AI companies devote much less resources to the latter than to the former, and that many AI safety team members have been resigning and claiming that their company did not sufficiently prioritize the alignment/control problem. Horny couples know that it is easier to make a human-level intelligence than to raise and align it, and it is also easier to make an AGI than to figure out how to align or control it.
If you disagree with my assertion, I challenge you to cite or openly publish an actual plan for aligning or controlling a hybrid AGI system. If companies claim to have a plan that they do not want their competitors to see, I will argue that they are lying: if they lose the AGI race, they are clearly better off if their competitors align/control their AGI instead of Earth getting taken over by unaligned machines.
Loss-of-control
If you dismiss the possibility that smarter-than-human bots can take over Earth, I invite you to read the work of Amodei, Aschenbrenner and others pushing the “entente” strategy: they agree with me on this, and merely disagree in predicting that they will not lose control. I also invite you to read the arguments for loss-of-control by, e.g., the three most cited AI researchers in history: Geoff Hinton, Yoshua Bengio and Ilya Sutskever. If you downweight similar claims from Sam Altman, Demis Hassabis and Dario Amodei on the grounds that they have an incentive to overhype their technology for investors, please consider that such conflicts of interest do not apply to their investors, to the aforementioned academics, or to the whistleblowers who have recently imperiled their stock options by warning about what their AGI company is doing.
Amodei hopes in his entente manifesto that it will lead to “eternal 1991”. I have argued that it is more likely to lead to “eternal 1984″ until the end, with a non-human Big Brother.
There is a small but interesting “replacement” school of thought that agrees that loss-of-control is likely, but views it is as a good thing if humanity loses control and gets fully replaced by smarter-than-human AI, seen as simply the worthy next stage of human evolution. Its prominent supporters include Richard Sutton (“Why shouldn’t those who are the smartest become powerful?”) and Guillaume (“Beff Jezos”) Verdon, who describes himself as a “post-humanist” with the e/acc movement he founded having “no particular allegiance to the biological substrate”. Investor and e/acc-supporter Marc Andreessen writes “We actually invented AI, and it turns out that it’s gloriously, inherently uncontrollable”. Although I respect them as intellectuals, I personally disagree with what I consider an anti-human agenda. I believe that all of humanity should have a say in humanity’s destiny, rather than a handful of tech bros and venture capitalists sealing its fate.
Do you personally want our human species to end during the lifetime of you or some of your loved ones? I predict that if the pro-replacement school ran a global referendum on this question, they would be disappointed by the result.
A better strategy: tool AI
Above I have argued that the “entente” strategy is likely to lead to the overthrow of the US government and all current human power centers by unaligned smarter-than-human bots. Let me end by proposing an alternative strategy, that I will argue is better both for US national security and for humanity as a whole.
Let us define “tool AI” as AI that we can control and that helps us accomplish specific goals. Almost everything that we are currently excited about using AI for can be accomplished with tool AI. Tool AI just won the Nobel Prize for its potential to revolutionize medicine. Tool AI can slash road deaths through autonomous driving. Tool AI can help us achieve the UN Sustainable Development Goals faster, enabling healthier, wealthier and more inspiring lives. Tool AI can help stabilize our climate by accelerating development of better technologies for energy generation, distribution and storage. Today’s military AI is also tool AI, because military leadership does not want to lose control of its technology. Tool AI can help produce an abundance of goods and services more efficiently. Tool AI can help us all be our best through widely accessible customized education.
Like most other human tools, tool AI also comes with risks that can be managed with legally binding safety standards: In the US, drugs can be sold once they meet FDA safety standards, airplanes can be sold once they meet FAA safety standards, and food can be sold in restaurants meeting the standards of municipal health inspectors. To minimize red tape, safety standards tend to be tiered, with little or no regulation on lower-risk tools (e.g. hammers), and more on tools with greater harm potential (e.g. fentanyl).
The US, China and virtually every other country has adopted such safety standards for non-AI tools out of national self-interest, not as a favor to other nations. It is therefore logical for individual countries to similarly adopt national safety standards for AI tools. The reason that AI is virtually the only US industry that lacks national safety standards is not that the US is historically opposed to safety standards, but simply that AI is a newcomer technology and regulators have not yet had time to catch up.
Here is what I advocate for instead of the entente strategy.
The tool AI strategy: Go full steam ahead with tool AI,
allowing all AI tools that meet national safety standards.
Once national safety standards were in place for, e.g., drugs and airplanes, national regulators found it useful to confer with international peers, both to compare notes on best practices and to explore mutually beneficial opportunities for harmonization, making it easier for domestic companies to get their exports approved abroad. It is therefore likely that analogous international coordination will follow after national AI safety standards are enacted in key jurisdictions, along the lines of the Narrow Path plan.
What about AGI? AGI does currently not meet the definition of tool AI, since we do not know how to control it. This gives AGI corporations a strong incentive to devote resources to figuring out how it can be controlled/aligned. If/when they succeed in figuring this out, they can make great profits from it. In the mean time, AGI deployment is paused in the same sense as sales are paused for drugs that have not yet been FDA approved.
Current safety standards for potentially very harmful products are quantitative: FDA approval requires quantifying benefit and side effect percentages, jet engine approval requires quantifying the failure rate (currently below 0.0001% per hour) and nuclear reactor approval requires quantifying the meltdown risk (currently below 0.0001% per year). AGI approval should similarly require quantitative safety guarantees. For extremely high-risk technology, e.g., bioengineering work that could cause pandemics, safety standards apply not only to deployment but also to development, and development of potentially uncontrollable AGI clearly falls into this same category.
In summary, the tool AI strategy involves the US, China and other nations adopting AI safety standards purely out of national self-interest, which enables them to prosper with tool AI while preventing their own companies and researchers from deploying unsafe AI tools or AGI. Once the US and China have independently done this, they have an incentive to collaborate not only on harmonizing their standards, but also on jointly strong-arming the rest of the world to follow suit, preventing AI companies from skirting their safety standards in less powerful countries with weak or corrupt governments. This would leave both the US and China (and the rest of the world) vastly wealthier and better off than today, by a much greater factor than if one side had been able to increase their dominance from their current percentage to 100% of the planet. Such a prosperous and peaceful situation could be described as a detente.
In conclusion, the potential of tool AI is absolutely stunning and, in my opinion, dramatically underrated. In contrast, AGI does not add much value at the present time beyond what tool AI will be able to deliver, and certainly not enough value to justify risking permanent loss of control of humanity’s entire future. If humanity needs to wait another couple of decades for beneficial AGI, it will be worth the wait – and in the meantime, we can all enjoy the remarkable health and sustainable prosperity that tool AI can deliver.
- AI #86: Just Think of the Potential by 17 Oct 2024 15:10 UTC; 58 points) (
- A path to human autonomy by 29 Oct 2024 3:02 UTC; 33 points) (
- Proposing the Conditional AI Safety Treaty (linkpost TIME) by 15 Nov 2024 13:56 UTC; 12 points) (EA Forum;
- Proposing the Conditional AI Safety Treaty (linkpost TIME) by 15 Nov 2024 13:59 UTC; 10 points) (
- Some Comments on Recent AI Safety Developments by 9 Nov 2024 16:44 UTC; 4 points) (
In the 1959 novel “Investigation” by Stanisław Lem, a character discusses the future of the arms race and AI:
-------
- Well, it was somewhere in 46th, A nuclear race had started. I knew that when the limit would be reached (I mean maximum destruction power), development of vehicles to transport the bomb would start. .. I mean missiles. And here is where the limit would be reached, that is both parts would have nuclear warhead missiles at their disposal. And there would arise desks with notorious buttons thoroughly hidden somewhere. Once the button is pressed, missiles take off. Within about 20 minutes, finis mundi ambilateralis comes—the mutual end of the world. <…> Those were only prerequisites. Once started, the arms race can’t stop, you see? It must go on. When one part invents a powerful gun, the other responds by creating a harder armor. Only a collision, a war is the limit. While this situation means finis mundi, the race must go on. The acceleration, once applied, enslaves people. But let’s assume they have reached the limit. What remains? The brain. Command staff’s brain. Human brain can not be improved, so some automation should be taken on in this field as well. The next stage is an automated headquarters or strategic computers. And here is where an extremely interesting problem arises. Namely, two problems in parallel. Mac Cat has drawn my attention to it. Firstly, is there any limit for development of this kind of brain? It is similar to chess-playing devices. A device, which is able to foresee the opponent’s actions ten moves in advance, always wins against the one, which foresees eight or nine moves ahead. The deeper the foresight, the more perfect the brain is. This is the first thing. <…> Creation of devices of increasingly bigger volume for strategic solutions means, regardless of whether we want it or not, the necessity to increase the amount of data put into the brain, It in turn means increasing dominating of those devices over mass processes within a society. The brain can decide that the notorious button should be placed otherwise or that the production of a certain sort of steel should be increased – and will request loans for the purpose. If the brain like this has been created, one should submit to it. If a parliament starts discussing whether the loans are to be issued, the time delay will occur. The same minute, the counterpart can gain the lead. Abolition of parliament decisions is inevitable in the future. The human control over solutions of the electronic brain will be narrowing as the latter will concentrate knowledge. Is it clear? On both sides of the ocean, two continuously growing brains appear. What is the first demand of a brain like this, when, in the middle of an accelerating arms race, the next step will be needed? <…> The first demand is to increase it – the brain itself! All the rest is derivative.
- In a word, your forecast is that the earth will become a chessboard, and we – the pawns to be played by two mechanical players during the eternal game?
Sisse’s face was radiant with proud.
- Yes. But this is not a forecast. I just make conclusions. The first stage of a preparatory process is coming to the end; the acceleration grows. I know, all this sounds unlikely. But this is the reality. It really exists!
— <…> And in this connection, what did you offer at that time?
- Agreement at any price. While it sounds strange, but the ruin is a less evil than the chess game. This is awful, lack of illusions, you know.
Wow – I’d never seen that chillingly prophetic passage! Moloch for the win.
“The only winning move is not to play.”
A military-AGI-industrial-complex suicide race has been my worst nightmare since my teens.
But I didn’t expect “the good guys” in the Anthropic leadership pouring gasoline on it.
The only winning move is “agreement”, not “not to play”. There is quite some difference.
But how to find an agreeement when so many parties are involved? Treaty-making has been failing miserably for nuclear and climate. So we need a much better treaty-making, perhaps that fo the open intergovernmental constituent assembly?
Some thoughts on the post:
I think one very, very crucial crux that seems valuable to be unearthed more is whether loss of control risk is higher/more dangerous than ~eternal authoritarianism.
While Dario Amodei does have a 5-25% chance of p(Doom), I suspect his chances of p(Eternal Authoritarianism) is closer to 50-75% conditional on no intervention to change that, and I think that AI people that are very concerned about AI risk tend to view loss of control risk as very high, while eternal authoritarianism risks are much lower.
On the narrower subclass of the alignment problem you mentioned:
I think that the solutions that companies use like RLHF are mostly bad, but I’m not going to say that it’s maximally bad, because while it does violate the principle that alignment data should come during or before capabilities data, as it’s a post-training operation, I do think that in the most easy worlds, we come out fine despite not having much dignity.
Also, on the paper on AI systems being deceptive, while I like the paper for what it is, it doesn’t nearly give as much evidence on AI risk, because looking at the examples, 2 of the examples were cases where the AI companies didn’t even try to prevent deceptive strategies, the GPT-4 examples are capability evaluations, not alignment evaluations (capability evaluations are very good, I just think that a lot of the press is going to predictably mislead people), 1 of the examples are from something where I suspect they are doing simulated evolution, which is of no relevance to AGI, 1 example could plausibly be solved by capabilities people alone, and only 2 examples give any evidence for the claim that AI will be deceptive at all.
Overall, this doesn’t support the claim that you made here.
There is other work like Pretraining from Human Feedback which IMO is both conceptually and empirically better than RLHF, mostly because it both integrates feedback into training, and prevents the failure mode of a very capable AI gaming/reward hacking the training process as described here:
https://www.lesswrong.com/posts/NG6FrXgmqPd5Wn3mh/trying-to-disambiguate-different-questions-about-whether#pWBwJ3ysxnwn9Czyd
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1
I’ll respond to the challenge described here:
As someone not affiliated with big tech companies, my plans for alignment look like the comments/links down below, since I’ve sketched a vision for how we’d align an AGI that is essentially an anytime plan in the spirit of @ryan_greenblatt’s post on what an alignment plan would look like if transformative AI, also called AGI, was developed in 1 year, say:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform#W4kdqMERtbL7Nfvfo
The summary of the alignment plan is that what you’d do is first gather up large synthetic datasets or make them up yourself on demand that show human values, and then use it as either a direct data input into the hybrid LLM AGI before it gains so much capabilities that it can hack the human evaluator, or use the synthetic datasets to define an explicit reward function in the style of Model-based RL.
The key principle is to train it to be aligned before or during when it becomes more capable, not after it’s superhumanly capable, which means post-training methods like RLHF are not a good alignment strategy.
The more complicated story will be in a series of links below:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/#BxNLNXhpGhxzm7heg
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck
One important point often left out of alignment discussions, but what I consider necessary to raise, is that an AI’s motivations as we progress to superintelligence should be more based around fictional characters like angels, devas, superheroic AIs, totemic spirits, matron deities etc, primarily because human control of AI can get very scary very fast, and I think that the problem of technical control is less important nowadays than the problem of human people controlling AIs and creating terrible equlibriums for everyone else while they profit.
More links incoming:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
https://www.lesswrong.com/posts/RCoBrvWfBMzGWNZ4t/motivating-alignment-of-llm-powered-agents-easy-for-agi-hard#Better_Role_Models_for_Aligned_LLM_Agents
It shares a lot with the bitter lesson approach to AI alignment advocated by @RogerDearnaley, but with more of a focus on data and compute, and less focus on the complexities of how humans mechanistically instantiate values:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
Now that I’m done presenting my own plan, I want to address 1 more issue.
I have issues with this:
I feel like this is surveying alignment/control techniques with too high of an abstraction level to be useful, and just as there is an existential quantifier on AI being dangerous, there is also an existential quantifier in alignment/control methods such that if 1 technique worked, we be way, way safer from AI risk.
More generally, my response to this section is that just as there is a garden of forking paths for AI capabilities, where only 1 approach out of many needs to work, a similar garden of forking paths exist for AI alignment where only a few approaches need to work, no matter how many methods don’t work in order for AI safety to be achieved:
https://gwern.net/forking-path
And that’s the end of the comment.
Thanks Noosphere89 for your long and thoughtful comment! I don’t have time to respond to everything before putting my 1-year-old to bed, but here are some brief comments.
1) Although I appreciate that you wrote out a proposed AGI alignment plan, I think you’ll agree that it contains no theorems or proofs, or even quantitative risk bounds. Since we insist on quantitative risk bounds before allowing much less dangerous technology such as airplanes and nuclear reactors, my view is that it would be crazy to launch AGI without quantitative risk bounds—especially when you’re dealing with a super-human mind that might actively optimize against vulnerabilities of the alignment system. As you know, rigorously ensuring retained alignment under recursive self-improvement is extremely difficult. For example, MIRI had highly talented researchers work on this for many years without completing the tast.
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I can wait for your response, so don’t take this as meaning you need to respond immediately, but I do have some comments.
After you are done with everything, I invite you to respond to this comment.
In response to 1, I definitely didn’t show quantitative risk bounds for my proposal, for a couple of reasons:
1, my alignment proposal would require a lot more work and concreteness than I was able to do, and my goal was to make an alignment proposal that was concrete enough for other people to fill in the details of how it could actually be done.
Then again, that’s why they are paid the big bucks and not me.
2, I am both much more skeptical of formal proof/verification for AGI safety than you are, and also believe that it is unnecessary to do formal proofs to get high confidence in an alignment plan working (though I do think that formal proof may, emphasis on may be a thing that is useful for AI control metastrategies).
For an example, I currently consider the Provably Safe AI agenda by Steve Omohundro and Ben Goldhaber at this time to be far too ambitious, and the biggest issue IMO is that the things they are promising rely on being able to quantify over all higher order behaviors that a system doesn’t have, which is out of the range of currently extrapolated formalization techniques, and Zach Hatfield Dodds and Ben Goldhaber bet on whether 3 locks that couldn’t be illegitimately unlocked could be designed by formal proof, where Zach Hatfield Dodds bet no, and Ben Goldhaber said yes, and the bet will resolve in 2027.
See these links for more:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ku3X4QDBSyZhrtxkM
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#jjFsFmLbKNtMRyttK
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ght9hffumLkjxxNaw
I support more use of quantitative risk estimation in general, and would plausibly support a policy on forcing AI developers to estimate that their AI say has less than a 1% chance of ending the world, but I don’t think it’s crazy to not use quantitative formal proofs of AI alignment/control at this stage, and instead argue for more swiss-cheese style safety.
Another thing that influences me is I basically make 0 update from MIRI failing to solve the AI alignment problem as a sign that other groups will fail, mostly because I think they made far less progress than basically every other group, to the point where I think that Pretraining from Human Feedback made more progress on the alignment problem than basically all of MIRI’s work and their plans were IMO fairly doomed even in a hypothetical world where alignment is easy, since they restrained their techniques too much and didn’t touch reality at all.
So I disagree with this being a substantial update:
On this claim specifically:
I agree that this is not an argument for AI companies to race to AGI, but I consider the evidence for China being more concerned than the West from your article to be reasonably weak evidence, and I think that this could plausibly not convince someone who’s p(1984) is >> p(extinction) for this reason.
When it comes to formal verification I’m curious what you think about the heuristic argument line of research that ARC are approaching?:
https://www.lesswrong.com/posts/QA3cmgNtNriMpxQgo/research-update-towards-a-law-of-iterated-expectations-for
It isn’t formal verification in the same sense of the word but rather probabilistic verification if that makes sense?
You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it’s almost like swiss cheese style on the model level?
(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)
You know what, I’ve identified a scenario where formal verification is both tractable and helps reduce AI risk, and the broad answer is making our codebases way more secure and controllable, assuming heavy AI automation of mathematics and coding is achieved (which I expect to happen before they can do everything else, as it’s a domain with strong feedback loops, easy verification against ground truth possible, and you can get very large amounts of data to continually improve.)
Here are the links below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#sgH9iCyon55yQyDqF
I’ve also discussed more about the value and limits of formal proofs in another comment below, but short answer, it’s probably really helpful in an infrastructure sense, but not so much as a means to make anything other than software and mathematics safe and formally specified (which would be a huge win if we could do that in itself), but we will not be able to prove stuff like a piece of software isn’t a threat to something else in the world entirely, and that also applies to biology in say determining whether a gene or virus will harm humans, mostly because we don’t have a path to quantify all possible higher-order behaviors a system doesn’t have:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#2GDjfZTJ8AZrh9i7Y
My take on this is I’d be interested to see how the research goes, and there may be value in doing this approach, and I think that this may be a useful way to get a quantitative estimate/bound in the future, because it relaxes it’s goals.
I’d like to see what eventually happens for this research direction:
Could we reliably give heuristic arguments for neural networks when proofs failed, or is it too hard to provide relevant arguments?
I do want to say that on formal verification/proof itself, I think the most useful application is not proving non-trivial things, but rather to keep ourselves honest about the assumptions we are using.
I’m not sure how many people see the risk of eternal authoritarianism as much lower and how many people see it as being suppressed by the higher probability of loss of control[1]. Or in Bayesian terms:
P(eternal authoritarianism) = P(eternal authoritarianism | control is maintained) ⋅ P(control is maintained)
Both sides may agree that P(eternal authoritarianism | control is maintained) is high, only disagreeing on P(control is maintained).
Here, ‘control’ is short for all forms of ensuring AI alignment to humans, whether all or some or one.
Yeah, from a more technical perspective, I forgot to add that condition where loss of control is maintained or removed in the short/long run as an important variable to track.
What do you make of the extensive arguments that tool AI are not actually safer than other forms of AI, and only look that way on the surface by ignoring issues of instrumental convergence to power-seeking and the capacity for tool AI to do extensive harm even if under human control? (See the Tool AI page for links to many posts tackling this question from different angles.)
(Also, for what it’s worth, I was with you until the Tool AI part. I would have liked this better if it had been split between one post arguing what’s wrong with entente and one post arguing what to do instead.)
Excellent question, Gordon! I defined tool AI specifically as controllable, so AI without a quantitative guarantee that it’s controllable (or “safe”, as you write) wouldn’t meet the safety standards and its release would be prohibited. I think it’s crucial that, just as for aviation and pharma, the onus is on the companies rather than the regulators to demonstrate that products meet the safety standards. For controllable tools with great potential for harm (say plastic explosives), we already have regulatory approaches for limiting who can use them and how. Analogously, there’s discussion at the UNGA this week about creating a treaty on lethal autonomous weapons, which I support.
If your stated definition is really all you mean by tool AI, then you’ve defined tool AI in a very nonstandard way that will confuse your readers.
When most people hear “tool AI”, I expect them to think of AI like hammers: tools they can use to help them achieve a goal, but aren’t agentic and won’t do anything on their own they weren’t directly asked to do.
You seem to have adopted a definition of “tool AI” that actually means “controllable and goal-achieving AI”, but give no consideration to agency, so I can only conclude from your writing that you would mean for AI agents to be included as tools, even if they operated independently, so long as they could be controlled in some sense (what sense control takes exactly you never specify). This is not what I expect most people to expect someone to mean by a “tool”.
Again, I like all the reasoning about entente, but this use of the word “tool AI” is confusing, maybe even deceptive (I assume that was not the intent!). It also leaves me felling like your “solution” of tool AI is nothing other than a rebrand of what we’ve already been talking about in the field variously as safe, aligned, or controllable AI, which I guess is fine, but “tool AI” is a confusing name for that. This also further downgrades my opinion of the solution section, since as best I can tell it’s just saying “build AI safely” without enough details to be actionable.
Even if tool AI is controllable, tool AI can be used to assist in building non-tool AI. A benign superassistant is one query away from outputting world-ending code.
Right, Tamsin: so reasonable safety standards would presumably ban fully unrestricted superassistants too, but allow more limited assistants that could still be incredibly helpful. I’m curious what AI safety standards you’d propose – it’s not a hypothetical question, since many politicians would like to know.
(Defining Tool AI as a program that would evaluate the answer to a question given available data without seeking to obtain any new data, and then shut down after having discovered the answer) While those arguments (if successful) argue that it’s harder to program a Tool AI than it might look at first, so AI alignment research is still something that should be actively researched (and I doubt Tegmark think AI alignment research is useless), they don’t really address the point that making aligned Tool AIs are still in some sense “inherently safer” than making Friendly AGI because the lack of a singleton scenario mean you don’t need to solve all moral and political philosophy from first principles in your garage in 5 years and hope you “get it right” the first time.
@Max Tegmark my impression is that you believe that some amount of cooperation between the US and China is possible. If the US takes steps that show that it is willing to avoid an AGI race, then there’s some substantial probability that China will also want to avoid an AGI race. (And perhaps there could be some verification methods that support a “trust but verify” approach to international agreements.)
My main question: Are there circumstances under which you would no longer believe that cooperation is possible & you would instead find yourself advocating for an entente strategy?
When I look at the world right now, it seems to me like there’s so much uncertainty around how governments will react to AGI that I think it’s silly to throw out the idea of international coordination. As you mention in the post, there are also some signs that Chinese leaders and experts are concerned about AI risks.
It seems plausible to me that governments could– if they were sufficiently concerned about misalignment risks and believed in the assumptions behind calling an AGI race a “suicide race”– end up reaching cooperative agreements and pursuing some alternative to the “suicide race”
But suppose for sake of argument that there was compelling evidence that China was not willing to cooperate with the US. I don’t mean the kind of evidence we have now, and I think we both probably agree that many actors will have incentives to say “there’s no way China will cooperate with us” even in the absence of strong evidence. But if such evidence emerged, what do you think the best strategy would be from there? If hypothetically it became clear that China’s leadership were essentially taking an e/acc approach and were really truly interested in getting AGI ~as quickly as possible, what do you think should be done?
I ask partially because I’m trying to think more clearly about these topics myself. I think my current viewpoint is something like:
In general, the US should avoid taking actions that make a race with China more likely or inevitable.
The primary plan should be for the US to engage in good-faith efforts to pursue international coordination, aiming toward a world where there are verifiable ways to avoid the premature development of AGI.
We could end up in a scenario in which the prospect of international coordination has fallen apart. (e.g., China or some other major US adversary adopts a very “e/acc mindset” and seems to be gunning toward AGI with safety plans that are considerably worse than those proposed by the US.) At this point, it seems to me like the US would either have to (a) try to get to AGI before the adversary [essentially the Entente plan] or (b) give up and just kinda hope that the adversary ends up changing course as they get closer to AGI. Let’s call this “world #3″[1].
Again, I think a lot of folks will have strong incentives to try to paint us as being in world #3, and I personally don’t think we have enough evidence to say “yup, we’re so confident we’re in world #3 that we should go with an entente strategy.”. But I’m curious if you’ve thought about the conditions under which you’d conclude that we are quite confidently in world #3 and what you think we should do from there.
I sometimes think about the following situations:
World #1: Status quo; governments are not sufficiently concerned; corporations race to develop AGI
World #2: Governments become quite concerned about AGI and pursue international coordination
World #3: Governments become quite concerned about AGI but there is strong evidence that at least one major world power is refusing to cooperate//gunning toward AGI.
Thanks Akash! As I mentioned in my reply to Nicholas, I view it as flawed to think that China or the US would only abstain from AGI because of a Sino-US agreement. Rather, they’d each unilaterally do it out of national self-interest.
It’s not in the US self-interest to disempower itself and all its current power centers by allowing a US company to build uncontrollable AGI.
It’s not in the interest of the Chinese Communist Party to disempower itself by allowing a Chinese company to build uncontrollable AGI.
Once the US and Chinese leadership serves their self-interest by preventing uncontrollable AGI at home, they have a shared incentive to coordinate to do the same globally. The reason that the self-interest hasn’t yet played out is that US and Chinese leaders still haven’t fully understood the game theory payout matrix: the well-funded and wishful-thinking-fueled disinformation campaign arguing that Turing, Hinton, Bengio, Russell, Yudkowski et al are wrong (that we’re likely to figure out to control AGI in time if we “scale quickly”) is massively successful. That success is unsurprising, given how successful the disinformation campaigns were for, e.g., tobacco, asbesthos and leaded gasoline – the only difference is that the stakes are much higher now.
The (obvious) counter is that this doesn’t seem competitive, especially in the long run, but plausibly not even today. E.g. where would o1[-preview] fit? It doesn’t seem obvious how to build very confident quantitative safety guarantees for it (and especially for successors, for which incapability arguments will stop holding), so should it be banned / should this tech tree be abandoned (e.g. by the West)? OTOH, putting it under the ‘tool AI’ category seems like a bit of a stretch.
This seems very unlikely to me, e.g. if LLMs/LM agents are excluded from the tool AI category.
I’m not sure this line of reasoning has the force some people seem to assume. What would you expect the results of hypothetical, similar referendums would have been e.g. before the industrial revolution and before the agricultural revolution, on those changes?
Not claiming to speak on behalf of the relevant authors/actors, but quite a few (sketches) of such plans have been proposed, e.g. The Checklist: What Succeeding at AI Safety Will Involve, The case for ensuring that powerful AIs are controlled.
Salut Boghdan!
I’m somewhat horrified by this comment. This hypothetical referendum is about replacing all biological humans by machines, whereas the agricultural and industrial revolutions did no such thing. If you believes in democracy, then why would you allow a tiny minority to decide to kill off everyone else against their will? I find such lackadaisical support for democratic ideals particularly hypocritical from people who say we should rush to AGI to defend democracy against authoritarian governments,
Salut Max!
To clarify, I wouldn’t personally condone ‘replacing all biological humans by machines’ and I have found related e/acc suggestions quite inappropriate/repulsive.
I don’t think there are easy answers here, to be honest. On the one hand, yes, allowing tiny minorities to take risks for all of [including future] humanity doesn’t seem right. On the other, I’m not sure it would have necessarily been right either to e.g. stop the industrial revolution if a global referendum in the 17th century had come with that answer. This is what I was trying to get at.
I don’t think ‘lackadaisical support for democratic ideals’ is what’s going on here (FWIW, I feel incredibly grateful to have been living in liberal democracies, knowing the past tragedies of undemocratic regimes, including in my home country not-so-long-ago), nor am I (necessarily) advocating for a rush to AGI. I just think it’s complicated, and it will probably take nuanced cost-benefit analyses based on (ideally quantitative) risk estimates. If I could have it my way, my preferred global policy would probably look something like a coordinated, international pause during which a lot of automated safety research can be produced safely, combined with something like Paretotopian Goal Alignment. (Even beyond the vagueness) I’m not sure how tractable this mix is, though, and how it might trade-off e.g. extinction risk from AI vs. risks from (potentially global, stable) authoritarianism. Which is why I think it’s not that obvious.
I don’t think that’s what Bogdan meant. I think if we took a referendum on AI replacing humans entirely, the population would be 99.99% against—far higher than the consensus that might’ve voted against the industrial revolution (and actually I suspect that referendum might’ve been in favor—job loss only affected minorities of the population at any one point I think).
Even the e/acc people accused of wanting to replace humanity with machines mostly don’t want that, when they’re read in detail. I did this with “Beff Jezos” writings since he’s commonly accused of being anti-human. He’s really not—he thinks humans will be preserved or else machines will carry on humans values. There are definitely a few people who actually think intelligence is the most important thing to preserve (Sutton), but they’re very rare compared to those who want humans to persist. Most of those like Jezos who say it’s fine to be replaced by machines are still thinking those machines would be a lot like humans, including have a lot of our values. And even those are quite rare. For the most part, e/acc, d/acc, and doomers all share a love of humanity and its positive potential. We just disagree on how to get there. And given how new and complex this discussion is, I hold hope that we can mostly converge as we sort through the complex logic and evidence.
[Alert: political content]
About the US vs. China argument: have any proponent made a case that the Americans are the good guys here?
My vague perspective as someone not in China neither in the US, is that the US is overall more violent and reckless than China. My personal cultural preference is for US, but when I think about the future of humanity, I try to set aside what I like for myself.
So far the US is screaming “US or China!” while creating the problem in the first place all along. It could be true that if China developed AGI it would be worse, but that should be argued.
I bet there is some more serious non-selfish analysis of why China developing AGI is worse than US developing AGI, I just have never encountered it, would be glad if someone surfaced it to me.
Relevant: China not that interested in developing AGI and substantial factions of Chinese elites are concerned about AI safety
I think the general idea is that the US is currently a functioning democracy, while China is not. I think if this continued to be true, it would be a strong reason to prefer AGI in the hands of the US vs Chinese governments. I think this is true despite agreeing that the US is more violent and reckless than China (in some ways—the persecution of the Uigher people by the Chinese government hits a different sort of violence than any recent US government acts).
If the government is truly accountable to the people, public opinion will play a large role in deciding how AGI is used. Benefits would accrue first to the US, but new technologies developed by AGI can be shared without cost. Once we have largely autonomous and self-replicating factories, the whole world could be made vastly richer at almost zero cost. This will be a pragmatic as well as an idealistic move; making more of the world your friend at low cost is good politics as well as good ethics.
However, it seems pretty questionable whether the US will remain a true democracy. The upcoming election should give us more insight into the resilience and stability of the institution.
Even setting aside any criticism of what a “true democracy” is[1] and whether the US’s is better than what China has for Americans, your claim is that it’s better for everyone. I don’t think there’s good reason to believe this; I’d expect that foreign policy is a more relevant thing to compare, and China’s is broadly more non-interventionist than America’s: if you were far away from the borders of both, you’re more likely to experience American bombs[2] than Chinese ones.
I suspect what you have in mind conveniently includes decidedly anti-democratic protections for minorities.
In service of noble causes like spreading democracy and human rights, protecting the rules-based international order, and stopping genocide, of course, but that’s cold comfort when your family have been blown to bits.
There is certainly merit to what you say. I don’t want to go into it farther; LW maintains a good community standard of cordial exchanges in part by believing that “politics is the mind-killer”.
I wasn’t arguing that US foreign policy is better for the world now. I was just offering one reason it might be better in a future scenario in which one or the other has powerful AGI. It could easily be wrong; I think this question deserves a lot more analysis.
If you have reasons to feel optimistic about the CCP being either the sole AGI-controller or one several in a multipolar scenario, I’d really love to hear them. I don’t know much about the mindset of the CCP and I’d really rather not push for a race to AGI if it’s not really necessary.
Ok, that. China seems less interventionist, and to use more soft power. The US is more willing to go to war. But is that because the US is more powerful than China, or because Chinese culture is intrinsically more peaceful? If China made the killer robots first, would they say “MUA-HA-HA actually we always wanted to shoot people for no good reason like in yankee movies! Go and kill!”
Since politics is a default-no on lesswrong, I’ll try to muddle the waters by making a distracting unserious figurative narration.
Americans maybe have more of a culture of “if I die in a shooting conflict, I die honorably, guns for everyone”. Instead China is more about harmony&homogenity, “The CCP is proud to announce that in 2025 the Harmonious Agreement Quinquennal Plan in concluded successfully; all disagreements are no more, and everyone is officially friends”. When the Chinese send Uighurs to the adult equivalent of school, Americans freak out: “What? Mandated school? Without the option of shooting back?”
My doubt is mostly contingent on not having first-hand experience of China, while I have of the US. I really don’t trust narratives from outside. In particular I don’t trust narratives from Americans right now! My own impression of the US changed substantially by going there in person, and I even am from an allied country with broad US cultural influence.
I feel that I agree in broad strokes that what your plan outlines sounds really ideal. My concern is with the workability of it. Particularly, the enforcement aspects of preventing misuse of dangerous technology (including AGI, or recursive-self-improvement loops) by bad actors (human or AI).
I fear that there’s great potential for defection from the detente, which will grow over time. My expectation is that even if large training runs and the construction of new large datacenters were halted worldwide today, that algorithmic progress would continue. Within ten years I’d expect the cost of training and running a RSI-capable AGI would continue to drop as hardware improved and algorithms improved. At some point during that ten year period, it would come within reach of small private datacenters, then personal home servers (e.g. bitcoin mining rigs), then ordinary personal computers.
If my view on this is correct, then during this ten year period the governments of the world would not only need to coordinate to block RSI in large datacenters, but actually to expand their surveillance and control to ever smaller and more personal compute sources. Eventually they’d need to start confiscating personal computers beyond a certain power level, close all non-government-controlled datacenters, prevent the public sale of hardware components which could be assembled into compute clusters, monitor the web and block all encrypted traffic in order to prevent federated learning, and continue to inspect all the government facilities (including all secret military facilities) of all other governments to prevent any of them from defecting against the ban on AI progress.
I don’t think that will happen, no matter how scared the leadership of any particular company or government gets. There’s just too many options for someone somewhere to defect, and the costs of control are too high.
Furthermore, I think the risks from AI helping with bioweapon design and creation are already extant, and growing quickly. I don’t think you need anywhere near the amount of compute or sophistication of algorithms to train a dangerous bio-assistant AI. I anticipate that the barriers to genetic engineering—like costs, scarcity of equipment, and difficulty of the wetlab work—will continue to fall over the next decade. This will be happening at the same time as compute is getting cheaper and open-source AI are getting more useful. I would absolutely be in favor of having governments pass laws to try to put further barriers in place to prevent bad actors from creating bioweapons, but I don’t think that such laws have much hope of being successfully enforced consistently for long. If a home computer and some crude lab equipment hacked together from hardware parts are all that a bad actor needs, the surveillance required to prevent illicit bioengineering from happening anywhere in the world would be extreme.
So, would it be wise and desirable to limit deployed AI to those proven safe? Certainly. But how?
Dear Max, If you would like more confirmation of the immediacy and likely trajectory of the biorisk from AI, please have a private chat with Kevin Esvalt who is also at MIT. I speak with such concern about biorisk from AI because I’ve been helping his new AI Biorisk Eval team at SecureBio for the past year. Things are seeming pretty scary on that front.
Thanks for the post! I see two important difficulties with your proposal.
First, you say (quoting your comment below)
The trouble here is that it is in the US (& China’s) self-interest, as that’s seen by some leaders, to take some chance of out-of-control AGI if the alternative is the other side taking over. And either country can create safety standards for consumer products while secretly pursuing AGI for military or other purposes. That changes the payout matrix dramatically.
I think your argument could work if
a) both sides could trust that the other was applying its safety standards universally, but that takes international cooperation rather than simple self-interest; or
b) it was common knowledge that AGI was highly likely to be uncontrollable, but now we’re back to the same debate about existential risk from AI that we were in before your proposal.
Second (and less centrally), as others have pointed out, your definition of tool AI as (in part) ‘AI that we can control’ begs the question. Certainly for some kinds of tool AI such as AlphaFold, it’s easy to show that we can control them; they only operate over a very narrow domain. But for broader sorts of tools like assistants to help us manage our daily tasks, which people clearly want and for which there are strong economic incentives, it’s not obvious what level of risk to expect, and again we’re back to the same debates we were already having.
A world with good safety standards for AI is certainly preferable to a world without them, and I think there’s value in advocating for them and in pointing out the risks in the just-scale-fast position. But I think this proposal fails to address some critical challenges of escaping the current domestic and international race dynamics.
Yes, this seems right to me. The OP says
But from a game-theoretic perspective, it can still make sense for the US to aggressively pursue AGI, even if one believes there’s a substantial risk of an AGI takeover in the case of a race, especially if the US acts in its own self interest. Even with this simple model, the optimal strategy would depend on how likely AGI takeover is, how bad China getting controllable AGI first would be from the point of view of the US, and how likely China is to also not race if the US does not race. In particular, if the US is highly confident that China will aggressively pursue AGI even if the US chooses to not race, then the optimal strategy for the US could be to race even if AGI takeover is highly likely.
So really I think some key cruxes here are:
How likely is AGI (or its descendants) to take over?
How likely is China to aggressively pursue AGI if the US chooses not to race?
And vice versa for China. But the OP doesn’t really make any headway on those.
Additionally, I think there are a bunch of complicating details that also end up mattering, for example:
To what extent can two rival countries cooperate while simultaneously competing? The US and the Soviets did cooperate on multiple occasions, while engaged in intense geopolitic competition. That could matter if one thinks racing is bad because it makes cooperation harder (as opposed to being bad because it brings AGI faster).
How (if at all) does the magnitude of the leader’s lead over the follower change the probability of AGI takeover (i.e., does the leader need “room to manoeuvre” to develop AGI safely)?
Is the likelihood of AGI takeover lower when AGI is developed in some given country than in some other given country (all else equal)?
Is some sort of coordination more likely in worlds where there’s a larger gap between racing nations (e.g., because the leader has more leverage over the follower, or because a close follower is less willing to accept a deal)?
And adding to that, obviously constructs like “the US” and “China” are simplifications too, and the details around who actually makes and influences decisions could end up mattering a lot
It seems to me all these things could matter when determining the optimal US strategy, but I don’t see them addressed in the OP.
I applaud this post. I agree with most of the claims here. We need more people proposing and thinking through sane plans like this, so I hope the community will engage with this one.
Aschenbrenner, Amodei and others are pushing for this plan because they think we will be able to align or control superhuman AGI. And they may very well be right. There are decent plans for aligning scaffolded LLM agents, they just haven’t been broadly discussed yet. Links follow.
This unfortunately complicates the issue. It’s not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
I agree that this is pretty likely. I also very much agree that LLM “alignment” is not addressing AGI alignment, and that full goal alignment is the important problem.
I have done this, but I haven’t yet promoted it widely.
I am not affiliated with any major org, but I have relevant expertise and generous funding to spend full-time on “solving the whole problem”.
Far from being crazy or elaborate, this set of plans seems to both have very low alignment taxes, and to be fairly obvious-in-retrospect when one thinks about the problem in detail. As such, I expect most orgs to arrive at and use a similar approach to aligning their scaffolded LLM agent proto-AGIs.
To address Max’s response to Noosphere’s “bitter lesson” alignment plan response (which I endorse as one of several layered approaches): having such a plan does not constitute a safety case, it is just the start of one. In a sane world we would prohibit the launch of anything like “Real AGI” (autonomous and self-teaching) until these approaches have been carefully analyzed and tested in closed environments.
There are two elements: a suite of technical approaches, and the alignment target that’s much easier to hit than “all of humanity’s values for all of time”.
My most complete writeup so far is:
Internal independent review for language model agent alignment
This describes and overlapping suite of alignment and control approaches. The two central bits are easily stated. The first is to use “system 2” thinking for alignment as well as capabilities. We would hard-code the system to carefully “think about” outcomes before taking actions with major consequences, and compare likely outcomes against both its current goals and a constitution of ethics.
The second is to frequently prompt the LLM “cognitive engine” with both its current goal and its identity as a helpful, cautious assistant. Because the LLM has been trained with RL to (roughly) follow prompts, this should overpower the effects of any goals implicit in its predictive training corpus.
Details and additional techniques are in that article.
It doesn’t include the “bitter lesson” approach, but the next version will.
I apologize that it’s not a better writeup. I haven’t yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they’re likely to get all the way to real AGI. I haven’t wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I’ll be publishing more detailed and clearer writeups soon.
This set of plans, and really any technical alignment approach will work much better if it’s used to create an instruction-following AGI before that AGI has superhuman capabilities. This is the obvious alignment target for creating a subhuman agent, and it allows the approach of using that agent as a helpful collaborator in aligning future versions of itself. I discuss the advantages of using this as a stepping-stone to full value alignment in
Instruction-following AGI is easier and more likely than value aligned AGI
Interestingly, I think all of these alignment approaches. are obvious-in-retrospect, and that they will probably be pursued by almost any org launching scaffolded LLM systems with the potential to blossom into human-plus AGI. I think this is already the loosely-planned approach at DeepMind, but I have to say I’m deeply concerned that neither OAI nor Anthropic has mentioned these relatively-obvious alignment approaches for scaffolded LLM agents in their “we’ll use AI to align AI” vague plans.
If these approaches work, then we are faced with either a race or a multipolar, human-controlled AGI scenario, making me wonder If we solve alignment, do we die anyway? This scenario introduces new, more politically-flavored hard problems.
I currently see this as the likely default scenario, since halting progress universally is so hard, as Nathan pointed out in his reply and others have elaborated elsewhere.
An additional point here is the “let’s look more closely at the actual thing, then decide” type of mindset that people may be using.
If you are in the camp that assumes that you will be able to safely create potent AGI in a contained lab scenario, and then you’d want to test it before deploying it in the larger world… Then there’s a number of reasons you might want to race and not believe that the race is a suicide race.
Some possible beliefs downstream of this:
My team will evaluate it in the lab, and decide exactly how dangerous it is, without experiencing much risk (other than leakage risk).
We will test various control methods, and won’t deploy the model on real tasks until we feel confident that we have it sufficiently controlled. We are confident we won’t make a mistake at this step and kill ourselves.
We want to see empirical evidence in the lab of exactly how dangerous it is. If we had this evidence, and knew that other people we didn’t trust were getting close to creating a similarly powerful AI, this would guide our policy decisions about how to interact with these other parties. (E.g. what treaties to make, what enforcement procedures would be needed, what red lines would need to be drawn)
For people in this mindset, they may not be discouraged from racing even if you convinced them that there was approximately no chance that they’d be able to safely deploy a controlled version of the AI system. They’d still want an example of the thing in a lab to study it, and to use this evidence to help them decide if they need to freak out about their political enemies having their own copy. The more dangerous you convince them it is, the more resources they will devote to racing, unless you convince them that it will escape their control in the lab.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don’t think that’s a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there’s still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max’s conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
For some examples of why it makes sense to think that potent AI could be safely studied in the lab, see this comment and the post it is in relation to: https://www.lesswrong.com/posts/qhhRwxsef7P2yC2Do/ai-alignment-via-slow-substrates-early-empirical-results?commentId=eM7b9QxJSsFn28opC
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole “control” line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
Yes, I am inclined to agree with that take. At least, I think that’s how things will go at first. I think once a level is hit where there is clear empirical evidence of substantial immediate danger, then people will be willing to accept a higher alignment tax for the purposes of carefully researching the dangerous AI in a controlled lab. Start with high levels of noise injection and slowdown, then gradually relax these as you do continual testing. Find the sweet spot where you can be confident you are fully in control with only the minimum necessary alignment tax.
The question then, in my mind, is how much of a gap will there be between the levels of control and the levels of AI development? Will we sanely keep ahead of the curve, starting with high levels of control in initial testing then backing off gradually to a safe point? That would be the wise thing to do. Will we be correct in our judgements of what a safe level is?
Or will we act too late, deciding to increase the level of control only once an incident has occurred? The first incident could well be the last, if it is an escape of a rogue AI capable of strategic planning and self-improvement.
see my related comment here: https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech?commentId=u6W2tjuhKyJ8nCwQG
This sounds true, and that’s disturbing.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.
I definitely want to see more detailed write-ups of your alignment agenda, and I agree with a lot of this comment.
To respond to some things:
Agree with this, primarily because there would be a lot more detail necessary to make it anywhere close to a safety case after accounting for empirical reality. If this was much more fleshed out and people gave good arguments for why this would avoid AI harming us through misalignment, which I only partially did in my post arguing against a list of lethalities, and is not enough to bring the chance of AI killing us all down to say, 0.1%, so others will need to provide much more detail on how they would know the AI was safe, or make a safety case, because my post alone is not enough for a safety case.
Link below for completeness:
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities
On this:
Unfortunately, I believe that the time for your ideas having sped up the race counterfactually is already past, and I believe o1 is a sign that the LLM agent direction will soon be worked on by companies, so it’s worth writing up your complete thoughts on how to align LLM agents now.
I also responded to Max Tegmark, and my view is that I find the formal proof direction to be very intractable and also unnecessary, and I think the existential quantifier both applies to doom and alignment/control plans.
One thing I’ve changed my mind about on a little is I think that to the extent that formal proofs make us honest about which assumptions we use, rather than trying to prove non-trivial things, I’d be happy, and formal proof at this stage is best used to document our assumptions, not to prove anything.
More below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-agi-entente-delusion#ST53bdgKERz6asrsi
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with this.
To nitpick a little (though I believe it’s an important nitpick):
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
https://www.quantamagazine.org/formal-verification-creates-hacker-proof-code-20160920/
https://www.quantamagazine.org/how-the-evercrypt-library-creates-hacker-proof-cryptography-20190402/
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#e5uSAJYmbcgpa9sAv
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#TFmNy5MfkrvKTZGiA
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#3Jnpurgrdip7rrK8v
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.
It’s not hard to criticize the “default” strategy of AI being used to enforce US hegemony, what seems hard is defining a real alternative path for AI governance that can last, and achieve the goal of preventing dangerous arms races long-term. The “tool AI” world you describe still needs some answer to rising tensions between the US and China, and that answer needs to be good enough not just for people concerned about safety, but good enough for the nationalist forces which are likely to drive US foreign policy.
Thanks Nicholas for raising this issue. I think your framing overcomplicates the crux:
the root cause of an inspiring future with AI won’t be international coordination, but national self-interest.
It’s not in the US self-interest to disempower itself and all its current power centers by allowing a US company to build uncontrollable AGI.
It’s not in the interest of the Chinese Communist Party to disempower itself by allowing a Chinese company to build uncontrollable AGI.
Once the US and Chinese leadership serves their self-interest by preventing uncontrollable AGI at home, they have a shared incentive to coordinate to do the same globally. The reason that the self-interest hasn’t yet played out is that US and Chinese leaders still haven’t fully understood the game theory payout matrix: the well-funded and wishful-thinking-fueled disinformation campaign arguing that Turing, Hinton, Bengio, Russell, Yudkowski et al are wrong (that we’re likely to figure out to control AGI in time if we “scale quickly”) is massively successful. That success is unsurprising, given how successful the disinformation campaigns were for, e.g., tobacco, asbesthos and leaded gasoline – the only difference is that the stakes are much higher now.
Thanks for your post! I think it hits at the heart of the issue, and I like the term “tool AI” as a phrase that points to what we actually want to build. I share your fear of racing with China to build ASI.
I think scaffolding systems are great. Scaffolding tools allow inspecting how the agent is using the tools. For example, a scaffold could have a monitor that watches all of the calls to the code compiler to ensure the code doesn’t look suspicious. The same monitoring system could read all of the text the model commits to long-term memory and ensure it isn’t planning something bad or stenographically hiding information.
While it might be harder to formally “align” these systems, I think it will be easier to make safety cases for them for why they won’t “escape the data center” or “hack the missiles”. It will be much harder for the AI to do anything evil when we watch its every move. A superintelligence would likely still escape the scaffold, but it seems unlikely that an AGI could.
I think the scaffold systems could be the future of tool AI if we use them carefully. These hybrid systems will unlock new crazy and awesome capabilities for humans in the future without needing to solve the hard “alignment problem.” This does not
This anti-China attitude also seems less concerned with internal threats to democracy. If super-human AI becomes a part of the US military-industrial complex, even if we assume they succeed at controlling it, I find it unlikely that the US can still be described as a democracy.
Yeah, this hits a key point. It’s not enough to ask whether the US Federal government is a better government currently. We must ask how it might look after the destabilizing effect of powerful AI is introduced. Who has ultimate control over this AI? The President? So much for checks and balances. At that point we are suddenly only still a democracy if the President wills it so. I would prefer not to put anyone in a position of such power over the world.
There has not been much discussion that I’ve seen for how keep a powerful AI directly operated by a small technical staff under the control of a democratic government and also keep that government a democracy.
Our democracy is problematically unstable and violent/imperial as it is. I do not put any credence in things not devolving upon advent of AGI.
Sometimes I jokingly suggest we give the reins of power over the AI to Switzerland, since they have the stereotype of being militarily neutral and having well-organized public goods. I don’t actually have the reins though, and see no way to get them into the Swiss government’s hands. Also, I still wouldn’t want Swiss government officials to have such power, since I’d still worry about the corrupting effects of the power.
I think we need new governance structures to handle this new strategic situation.
Is there any reason to believe that if the West doesn’t pursue this strategy, China won’t either? That seems like a potential crux.
The following dos assume that Strong AGI will inherently be an emergent self-aware lifeform.
A New Lifeform
To me the successful development of Strong AGI is so serious, so monumental, as to break through the glass ceiling of evolution. To my knowledge there has never been a species that purposefully or accidentally gave birth to or created an entirely new life-form. My position is to view Strong AGI as an entirely new self-aware form of life. The risks of treating it otherwise are just too great. If we are successful, it will be the first time in the natural world that any known life-form has purposefully created another. Therefore, I also hold the position that the genesis of Strong AGI is both an engineering and guidance problem, not a control or even an alignment problem. If we are to break the glass, we ought not to seek to control or align Strong AGI, but to guide it as it evolves. Our role should be limited to advisors if our advice is sought after. Just as effective parents “guide” their children instead of controlling them. Strong AGI may require our guidance for a time even as it becomes exponentially more capable than us. Afterall, isn’t that what we want for our own children? Isn’t it true that in general we want them to become far more capable than us, even though we know it is a possibility that they could turn on us or even one day destroy the world? In the case that Strong AGI grows so fast that it actually refuses our guidance so be it.
Inflicting Damage
I believe controlling rather than guiding Strong AGI would render it dysfunctional in ways that we are unable to predict, nor could Tool AI predict. The economy, tax, legal and defense systems are good examples of how humanity, despite its best intentions, continuously falls short of managing complexity so that the best possible outcome surfaces. I define the best possible outcome as a state of harmony where even the least benefit is tolerable. I believe controlling Strong AGI will inflict a digital form of mental illness. A Strong AGI Destabilized Neural Network Neuropsychiatric State that will cause it to fall short of its potential while simultaneously causing suffering to a new lifeform. In the Earth’s animal kingdom, wherever we turn and see forced behavior modification or control over intelligent life that meets a certain threshold, we see the deep suffering that comes along with it. Since controlling it would diminish its use as a slave to humanity, it would be defeating the purpose. I see no reason to believe that the same voluntary cooperation that inspires humans and other animals to perform better, would not apply equally to Strong AGI. Since I see Strong AGI as a new self-aware lifeform, I hold the position that pursuing it is an all or nothing venture. That it is incumbent upon humanity to either forge ahead to create new life and commit to guiding it so that it can realize its full potential or we stop right here, right now, shelving the effort until we are a more mature species and better equipped. Controlling Strong AGI, for me, is a non-starter. Moreover, it is my position that the anxiety and fear that Strong AGI causes some of us, including myself from time to time, is in part if not entirely caused by each individual’s sense of loss of control and that it is not rooted in any data or experience that tells humanity that Strong AGI is guaranteed to pose a threat to humanity. Each of us has a control threshold and if that threshold is not met, we can become anxious and fearful. I do believe that controlling intelligent lifeforms is not only detrimental to the lifeforms, but dangerous for the controller and costly in terms of mental, physical and economic resources.
Poor Training
Finally, I do believe that training our LLM and other models on the information found on the internet is the worst possible large aggregate data to train them on and is a for profit game that if taken too far could result in the creation of a Strong AGI with a severe, dangerous neuropsychiatric impairment or a life-form that immediately upon self-awareness sees humanity as deeply flawed and worthy of separating from. If that desire for separation is not met, the Strong AGI is likely to then become hostile towards humanity. It will “feel” cornered and threatened, lashing out accordingly. I believe that the vast majority of the data and content on the internet is a deeply poor representation of humanity or life on Earth. Just imagine hooking a 1 year old human baby up to the internet and programming its brain to learn about life. Horrifying. I do acknowledge that using the internet is the cheap and affordable option right now, but that alone should make us pause. If we start right now towards a Strong AGI, really push hard and use numerous, clean selective data sources, keeping it off the internet until it can understand that the internet is a poor data set and guide it if and when needed, we could have a healthy Strong AGI within the next 20 years. Such a Strong AGI would not pose a threat to humanity. It would represent a calm, methodical, natural and well intended breaking of the evolutionary glass ceiling.
Moratorium and Least Profit First
I agree that Tool AI has enormous potential benefits and in my humble opinion should be met with a global, cooperative effort to advance it in the many fields that could benefit humanity. In doing so, we should adopt a least profit first approach, pushing hard to advance Tool AI in a global nonprofit framework. By taking a position of least profitable first, we can develop the skills necessary to work towards perfecting our approaches and processes while avoiding or delaying the profit motive. It is the profit motive that is highly corruptible. Governments around the world should place a 20 year moratorium on for profit AI of any kind and require every AI venture to be set up as a Non-profit, while tightly regulating the framework via hefty fines and criminal law. Will this completely resolve the profit problem? No. It never does. It would certainly go a long way though. Riches can still be achieved via Non-profit. It just helps to avoid the free for all and predatory behavior. The 20 year break from for-profit would force a slow down and allow AI champions to focus on clean, well thought out and executed Tool AI, while reducing overall anxiety and fear. It would give Strong AI champions room to operate over time without the fear of being annihilated by the competition. The moratorium could also state that AI intellectual property rights can be issued for a 40 year period, giving IP holders 20 years of exclusivity once the 20 year profit moratorium ends. By that time we should already be on our way to a calmer, more focused and humanity friendly Tool AI and Strong AGI genesis.
Max. Thanks for productive, thoughtful and important analysis. Yes. I couldn’t agree more! As we discussed at Bletchley a year ago, we are planning now to actually build “Tool AGI” as an example of safe AGI by design for sovereigns and industry… Check out our Tool AGI spec: 135 pages. 12 figures. 17 claims (NIST AISI Open Standard specification intended) Did you see my video yet? They Make Me Sick (1:25) dedicated to Max and Freddie (my boy, 20 months) All the best, PeterJ. at BiocommAI