AI Governance researcher with Polaris Ventures, formerly of the MTAIR project, TFI and Center on Long-term Risk, Graduate researcher at Kings and AI MSc at Edinburgh. Interested in philosophy, longtermism and AI Alignment.
Sammy Martin
Let me clarify an important point: The strategy preferences outlined in the paper are conditional statements—they describe what strategy is optimal given certainty about timeline and alignment difficulty scenarios. When we account for uncertainty and the asymmetric downside risks—where misalignment could be catastrophic—the calculation changes significantly. However, it’s not true that GM’s only downside is that it might delay the benefits of TAI.
Misalignment (or catastrophic misuse) has a much larger downside than a successful moratorium. That is true, but trying to do a moratorium, losing your lead, and then someone else developing catastrophically misaligned AI when you could have developed a defense against it if you’d adopted CD or SA has just as large a downside.
And GM has a lower chance of being adopted than CD or SA, so the downside to pushing for a moratorium is not necessarily lower.
Since a half-successful moratorium is the worst of all worlds (assuming that alignment is feasible) because you lose out on your chances of developing defenses against unaligned or misused AGI, it’s not always true that the moratorium plan has fewer downsides than the others.
However, I agree with your core point—if we were to model this with full probability distributions over timeline and alignment difficulty, GM would likely be favored more heavily than our conditional analysis suggests, especially if we place significant probability on short timelines or hard alignment
Analysis of Global AI Governance Strategies
The fact that an AI arms race would be extremely bad does not imply that rising global authoritarianism is not worth worrying about (and vice versa)
I am someone who is worried both about AI risks (from loss of control, and from war and misuse/structural risks) and from what seems to be a ‘new axis’ of authoritarian threats cooperating in unprecedented ways.
I won’t reiterate all the evidence here, but these two pieces and their linked sources should suffice:
Despite believing this thesis, I am not, on current evidence, in favor of aggressive efforts to “race and beat China” in AI, or for abandoning attempts to slow an AGI race. I think on balance it is still worth trying these kinds of cooperation, while being clear eyed about the threats we face. I do think that there are possible worlds where, regretfully and despite the immense dangers, there is no other option but to race. I don’t think that we are in such a world as of yet.
However, I notice that many of the people who agree with me that an AI arms race would be very bad and that we should avoid it tend to diminish the risks of global authoritarianism or the difference between the west and its enemies, and very few seem to buy into the above thesis that there is a dangerous interconnected web of authoritarian states with common interests developing.
Similarly, most of the people who see the authoritrian threat which has emerged into clear sight over the last few years (from China, Russia, Iran, North Korea and similar actors) want to respond by racing and think alignment will not be too difficult. This includes the leaders of many AI companies who may have their own less patriotic reasons for pushing such an agenda.
I think this implicit correlation should be called out as a mistake.
As a matter of simple logic, how dangerous frantic AGI development is, and how hostile foreign adversaries are, are two unrelated variables which shouldn’t correlate.
In my mind, the following are all true:
An AI arms race would be extraordinarily dangerous, drastically raise the chance of nuclear war, and also probably raise the chance of loss of control of AGI leading to human extinction or of destructive misuse. It’s well worth trying hard to avoid AI arms races, even if our adversaries are genuinely dangerous and we won’t cooperate with them in general on other matters, even if the prospects seem dim.
it is clearly much better that democratic societies have control of an AGI singleton than non-democratic countries like China, if those are the options. And, given current realities, there is a chance that an arms race is inevitable no matter how dangerous it is. If an arms race is inevitable, and transformative AI will do what we want, it is much better that the western democratic world is leading instead of authoritarian countries, especially if it is also developing AI under safer and more controlled conditions (which seems likely to me)
If alignment isn’t solvable or if the offense-defense balance is unfavorable, then it doesn’t matter who develops AGI as it is a suicide race. But we don’t know if that is the case as of yet.
I basically never see these 3 acknowledged all at once. We either see (1) and (3) grouped together or (2) alone. I’m not sure what the best AI governance strategy to adopt is, but an analysis should start with a clear eyed understanding of the international situation and what values matter.
- Analysis of Global AI Governance Strategies by 4 Dec 2024 10:45 UTC; 38 points) (
- 11 Dec 2024 11:32 UTC; 2 points) 's comment on Analysis of Global AI Governance Strategies by (
One underlying idea comes from how AI misalignment is intended to work. If superintelligent AI systems are misaligned, does this misalignment look like an inaccurate generalization from what their overseers wanted, or a ‘randomly rolled utility function’ deceptively misaligned goal that’s entirely unrelated to anything their overseers intended to train? This is represented by Levels 1-4 vs levels 5+, in my difficulty scale, more or less. If the misalignment is result of economic pressures and a ‘race to the bottom’ dynamic then its more likely to result in systems that care about human welfare alongside other things.
If the AI that’s misaligned ends up ‘egregiously’ misaligned and doesn’t care at all about anything valuable to us, as Eliezer thinks is most likely, then it places zero terminal value on human welfare and only trade, threats or compromise would get it to be nice. If the AI super-intelligent and you aren’t, none of those considerations apply. Hence, nothing is left for humans.
If the AI is misaligned but doesn’t have an arbitrary value system, then it may value human survival at least a bit and do some equivalent of leaving a hole in the dyson sphere.
For months, those who want no regulations of any kind placed upon themselves have hallucinated and fabricated information about the bill’s contents and intentionally created an internet echo chamber, in a deliberate campaign to create the impression of widespread opposition to SB 1047, and that SB 1047 would harm California’s AI industry.
There is another significant angle to add here. Namely: Many of the people in this internet echo chamber or behind this campaign are part of the network of neoreactionaries, MAGA supporters, and tech elites who want to be unaccountable that you’ve positioned yourself as a substantial counterpoint to.
Obviously it’s invoking a culture war fight which has its downsides, but it’s not just rhetoric: the charge that many bill opponents are basing their decisions on an ideology that Newsom opposes and sees as dangerous for the country, is true.
A16z and many other of the most dishonest opponents of the bill, are part of a Trump-supporting network with lots of close ties to neoreactionary thought, which opposes SB 1047 for precisely the same reason that they want Trump and republicans to win: to remove restraints on their own power in the short-to-medium term, and more broadly because they see it as a step towards making our society into one where wealthy oligarchs are given favorable treatment and can get away with anything.
It also serves as a counterpoint against the defense and competition angle, at least if its presented by a16z (this argument doesn’t work for e.g. OpenAI, but there are many other good counterarguments). The claims they make about the bill harming competitiveness e.g. for defense and security against China and other adversaries ring hollow when most of them are anti-Ukraine support or anti-NATO, making it clear they don’t generally care about the US maintaining its global leadership.
I think this would maybe compel Newsom who’s positioned himself as an anti-MAGA figure.
I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn’t explicitly connect it to the potential for reducing implementation costs and you’re right to point that out.
Let me clarify how the framework handles this aspect and elaborate on its implications.
Key points:
Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what’s the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs!
The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If full alignment automation of systems capable of performing (non alignment related) HITs is construed as an HIT, the actual alignment difficulty corresponds to the level required to align the AI system performing the automation, not the automated task itself.
In practice, a combination of AI systems at various alignment difficulty levels will likely be employed to reduce costs and risks for both alignment-related tasks and other applications. Partial automation and acceleration by AI systems can significantly impact the cost curve for implementing advanced alignment techniques, even if full automation is not possible.
The cost curve presented in the original post assumes no AI assistance, but in reality, AI involvement in alignment research could substantially alter its shape. This is because the cost curve covers the cost of performing research “to achieve the given HITs”, and since substantially automating alignment research is a possible HIT, by definition the cost graph is not supposed to include substantial assistance on alignment research.
However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating alignment research and also incrementally reducing overall risk.
To illustrate, consider a scenario where scalable oversight at level 4 of the alignment difficulty scale is used to fully automate mechninterp at level 6, and then this level 6 system can go on to say research a method of impregnable cyber-defense, rapid counter-bio-weapon vaccines, give superhuman geopolitical strategic advice, and unmask any unaligned AI present on the internet.
In this case, the actual difficulty level would be 4, with the HIT being the automation of the level 6 technique that’s then used to reduce risk substantially.
I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That’s what I said in my recent discussion of what this does and doesn’t show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that ‘of course’ in real life ‘in the near future’ we’ll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won’t work, but that’s the entire contentious part of the argument!
One can make arguments that these oversight methods will break down e.g. when the system is generally superhuman at predicting what feedback its overseers will provide. However, those arguments were theoretical when they were made years ago and they’re still theoretical now.
This does count against naive views that assume alignment failures can’t possibly happen: there probably are those out there who believe that you have to give an AI system an “unreasonably malicious” rather than just “somewhat unrealistically single minded” prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can’t possibly be deceptive.
Good point. You’re right to highlight the importance of the offense-defense balance in determining the difficulty of high-impact tasks, rather than alignment difficulty alone. This is a crucial point that I’m planning on expand on in the next post in this sequence.
Many things determine the overall difficulty of HITs:
the “intrinsic” offense-defense balance in related fields (like biotechnology, weapons technologies and cybersecurity) and especially whether there are irresolutely offense-dominant technologies that transformative AI can develop and which can’t be countered
Overall alignment difficulty, affecting whether we should expect to see a large number of strategic, power seeking unaligned systems or just systems engaging in more mundane reward hacking and sycophancy.
Technology diffusion rates, especially for anything offense dominant, e.g. should we expect frontier models to leak or be deliberately open sourced
Geopolitical factors, e.g. are there adversary countries or large numbers of other well resourced rogue actors to worry about not just accidents and leaks and random individuals
The development strategy (e.g. whether the AI technologies are being proactively developed by a government or in public-private partnership or by companies who can’t or won’t use them protectively)
My rough suspicion is that all of these factors matter quite a bit, but since we’re looking at “the alignment problem” in this post I’m pretending that everything else is held fixed.
The intrinsic offense-defense balance of whatever is next on the ‘tech tree’, as you noted, is maybe the most important overall, as it affects the feasibility of defensive measures and could push towards more aggressive strategies in cases of strong offense advantage. It’s also extremely difficult to predict ahead of time.
How difficult is AI Alignment?
Yes, I do think constitution design is neglected! I think it’s possible people think constitution changes now won’t stick around or that it won’t make any difference in the long term, but I do think based on the arguments here that even if it’s a bit diffuse you can influence AI behavior on important structural risks by changing their constitutions. It’s simple, cheap and maybe quite effective especially for failure modes that we don’t have any good shovel-ready technical interventions for.
What is moral realism doing in the same taxon with fully robust and good-enough alignment? (This seems like a huge, foundational worldview gap; people who think alignment is easy still buy the orthogonality thesis.)
Technically even Moral Realism doesn’t imply Anti-Orthogonality thesis! Moral Realism is necessary but not sufficient for Anti-Orthogonality, you have to be a particular kind of very hardcore platonist moral realist who believes that ‘to know the good is to do the good’, to be Anti-Orthogonality, and argue that not only are there moral facts but that these facts are intrinsically motivating.
Most moral realists would say that it’s possible to know what’s good but not act on it: even if this is an ‘unreasonable’ disposition in some sense, this ‘unreasonableness’ it’s compatible with being extremely intelligent and powerful in practical terms.
Even famous moral realists like Kant wouldn’t deny the Orthogonality thesis: Kant would accept that it’s possible to understand hypothetical but not categorical imperatives, and he’d distinguish capital-R Reason from simple means-end ‘rationality’. I think from among moral realists, it’s really only platonists and divine command theorists who’d deny Orthogonality itself.
I don’t believe that you can pass an Ideological Turing Test for people who’ve thought seriously about these issues and assign a decent probability to things going well in the long term, e.g. Paul Christiano, Carl Shulman, Holden Karnofsky and a few others.
The futures described by the likes of Carl Shulman, which I find relatively plausible, don’t fit neatly into your categories but seem to be some combination of (3) (though you lump ‘we can use pseudo aligned AI that does what we want it to do in the short term on well specified problems to navigate competitive pressures’ in with ‘AI systems will want to do what’s morally best by default so we don’t need alignment’) and (10) (which I would phrase as ‘unaligned AIs which aren’t deceptively inner misaligned will place some value on human happiness and placating their overseers’), along with (4) and (6), (specifically that either an actual singleton or a coalition with strong verifiable agreements and regulation can emerge, or that coordination can be made easier by AI advice).
Both 3 and 10 I think are plausible when phrased correctly: that pseudo-aligned powerful AI can help governments and corporations navigate competitive pressures, and that AI systems (assuming they don’t have some strongly deceptively misaligned goal from a sharp left turn) will still want to do things that satisfy their overseers among the other things they might want, such that there won’t be strong competitive pressures to kill everyone and ignore everything their human overseers want.
In general, I’m extremely wary of arguing by ‘default’ in either direction. Both of these are weak arguments for similar reasons:
The default outcome of “humans all have very capable AIs that do what they want, and the humans are otherwise free”, is that humans gain cheap access to near-infinite cognitive labor for whatever they want to do, and shortly after a huge increase in resources and power over the world. This results in a super abundant utopian future.
The default outcome of “humans all have very capable AIs that do what they want, and the humans are otherwise free”, is that ‘the humans all turn everything over to their AIs and set them loose to compete’ because anyone not doing that loses out, on an individual level and on a corporate or national level, resulting in human extinction or dystopia via ‘race to the bottom’
My disagreement is that I don’t think there’s a strong default either way. At best there is a (highly unclear) default towards futures that involve a race to the bottom, but that’s it.
I’m going to be very nitpicky here but with good cause. “To compete”—for what, money, resources, social approval? “The humans”—the original developers of advanced AI, which might be corporations, corporations in close public-private partnership, or big closed government projects?
I need to know what this actually looks like in practice to assess the model, because the world is not this toy model, and the dis-analogies aren’t just irrelevant details.
What happens when we get into those details?
We can try to make this specific, envision a concrete scenario involving a fairly quick full automation of everything, and think about (in a scenario where governments and corporate decision-makers have pseudo-aligned AI which will e.g. answer questions super-humanly well in the short term) a specific bad scenario, and then also imagine ways it could not happen. I’ve done this before, and you can do it endlessly:
Here’s one of the production web stories in brief but you can read it in full along with my old discussion here,
In the future, AI-driven management assistant software revolutionizes industries by automating decision-making processes, including “soft skills” like conflict resolution. This leads to massive job automation, even at high management levels. Companies that don’t adopt this technology fall behind. An interconnected “production web” of companies emerges, operating with minimal human intervention and focusing on maximizing production. They develop a self-sustaining economy, using digital currencies and operating beyond human regulatory reach. Over time, these companies, driven by their AI-optimized objectives, inadvertently prioritize their production goals over human welfare. This misalignment leads to the depletion of essential resources like arable land and drinking water, ultimately threatening human survival, as humanity becomes unable to influence or stop these autonomous corporate entities.
My object-level response is to say something mundane along the lines of, I think each of the following is more or less independent and not extremely unlikely to occur (each is above 1% likely):
Wouldn’t governments and regulators also have access to AI systems to aid with oversight and especially with predicting the future? Remember, in this world we have pseudo-aligned AI systems that will more or less do what their overseers want in the short term.
Couldn’t a political candidate ask their (aligned) strategist-AI ‘are we all going to be killed by this process in 20 years’ and then make a persuasive campaign to change the public’s mind with this early in the process, using obvious evidence to their advantage
If the world is alarmed by the expanding production web and governments have a lot of hard power initially, why will enforcement necessarily be ineffective? If there’s a shadow economy of digital payments, just arrest anyone found dealing with a rogue AI system. This would scare a lot of people.
We’ve already seen pessimistic views about what AI regulations can achieve self-confessedly be falsified at the 98% level—there’s sleepwalk bias to consider. Stefan schubert: Yeah, if people think the policy response is “99th-percentile-in-2018”, then that suggests their models have been seriously wrong. So maybe the regulations will be both effective, foresightful and well implemented with AI systems forseeing the long-run consequences of decisions and backing them up.
What if the lead project is unitary and a singleton or the few lead projects quickly band together because they’re foresightful, so none of this race to the bottom stuff happens in the first place?
If it gets to the point where water or the oxygen in the atmosphere is being used up (why would that happen again, why wouldn’t it just be easier for the machines to fly off into space and not have to deal with the presumed disvalue of doing something their original overseers didn’t like?) did nobody build in ‘off switches’?
Even if they aren’t fulfilling our values perfectly, wouldn’t the production web just reach some equilibrium where it’s skimming off a small amount of resources to placate its overseers (since its various components are at least somewhat beholden to them) while expanding further and further?
And I already know the response is just going to be “Moloch wouldn’t let that happen..” and that eventually competition will mean that all of these barriers disappear. At this point though I think that such a response is too broad and proves too much. If you use the moloch idea this way it becomes the classic mistaken “one big idea universal theory of history” which can explain nearly any outcome so long as it doesn’t have to predict it.
A further point: I think that someone using this kind of reasoning in 1830 would have very confidently predicted that the world of 2023 would be a horrible dystopia where wages for workers wouldn’t have improved at all because of moloch.
You can call this another example of (11), i.e. assuming the default outcome will be fine and then this arguing against a specific bad scenario, so it doesn’t affect the default, but that assumes what you’re trying to establish.
I’m arguing that when you get into practical details of any scenario (assuming pseudo-aligned AI and no sharp left turn or sudden emergence), you can think of ways to utilize the vast new cognitive labor force available to humanity to preempt or deal with the potential competitive race to the bottom, the offense-defense imbalance, or other challenges, which messes up the neat toy model of competitive pressures wrecking everything.
When you try to translate the toy model of:
“everyone” gets an AI that’s pseudo aligned --> “everyone” gives up control and lets the AI “compete” --> “all human value is sacrificed” which presumably means we run out of resources on the earth or are killed
into a real world scenario by adding details on about who develops the AI systems, what they want, and specific ways the AI systems could be used, we also get practical, real world counterarguments as to why it might not happen. Things still get dicey the faster takeoff is and the harder alignment is, as we don’t have this potential assistance to rely on and have to do everything ourselves, but we already knew that, and you correctly point out that nobody knows the real truth about alignment difficulty anyway.
To be clear, I still think that some level of competitive degradation is a default, there will be a strong competitive pressure to delegate more and more decision making to AI systems and take humans out of the loop, but this proceeding unabated towards systems that are pressured into not caring about their overseers at all, resulting in a world of ‘perfect competition’ with a race to the bottom that proceeds unimpeded until humans are killed, is a much weaker default than you describe in practice.
Treating a molochian race to the bottom as an overwhelmingly strong default ignores the complexity of real-world systems, the potential for adaptation and intervention, and the historical track record of similar predictions about complicated social systems.
Similarly, his ideas of things like ‘a truth seeking AI would keep us around’ seem to me like Elon grasping at straws and thinking poorly, but he’s trying.
The way I think about Elon is that he’s very intelligent but essentially not open to any new ideas or capable of self-reflection if his ideas are wrong, except on technical matters: if he can’t clearly follow the logic himself, on the first try, or there’s a reason it would be uncomfortable or difficult to ignore it initially then he won’t believe you, but he is smart.
Essentially, he got one good idea about AI risk into his head 10+ years ago and therefore says locally good things and isn’t simply lying when he says them, but it doesn’t hang together in his head in a consistent way (e.g. if he thought international stability and having good AI regulation was a good idea he wouldn’t be supporting the candidate that wants to rethink all US alliances and would impair the federal government’s ability to do anything new and complicated, with an e/acc as his running mate). In general, I think one of the biggest mental blind spots EA/rationalist types have is overestimating the coherence of people’s plans for the future.
The Economist is opposed, in a quite bad editorial calling belief in the possibility of a catastrophic harm ‘quasi-religious’ without argument, and uses that to dismiss the bill, instead calling for regulations that address mundane harms. That’s actually it.
I find this especially strange almost to the point that I’m willing to call it knowingly bad faith. The Economist in the past has sympathetically interviewed Helen Toner, done deep dive investigations into mechanistic interpretability research that are at a higher level of analysis than I’ve seen from any other mainstream news publication, ran articles which acknowledge the soft consensus among tech workers and AI experts on the dangers, which included survey results on these so it’s doubly difficult to dismiss is as “too speculative” or “scifi”.
To state without elaboration that the risk is “quasi-religious” or “science fictional” when their own journalists have consistently said the opposite and provided strong evidence that the AI world generally agrees makes me feel like someone higher up changed their mind for some reason regardless of what their own authors think.
The one more concrete reference they gave was to very near term (like the next year) prospect of AI systems being used to assist in terrorism, which has indeed been slightly exaggerated by some, but to claim that there’s no idea whatsoever about where these capabilities could be in 3 years is absurd given what they themselves have said in previous articles.
Without some explanation as to why they think genuine catastrophic misuse concerns are not relatively near term and relatively serious (e.g. explaining why they think we won’t see autonomous agents that could play a more active role in terrorism if freely available) it just becomes the classic “if 2025 is real why isn’t it 2025 now” fallacy.
The short argument I’ve been using is:
If you want to oppose the bill you as a matter of logical necessity have to believe some combination of,
No significant near-term catastrophic AI risks exist that warrant this level of regulation.
Significant near-term risks exist, but companies shouldn’t be held liable for them (i.e. you’re an extremist ancap)
Better alternatives are available to address these risks.
The bill will be ineffective or counterproductive in addressing these risks.
The best we get is vague hints that (1) is true from some tech leaders but e.g. google, OpenAI definitely doesn’t believe (1), or we get vague pie in the sky appeals to (3) as if the federal government is working efficiently on frontier tech issues right now, or claims for (4) that either lie about the content of the bill, e.g. claiming it applies to small startups and academics, or say fearmongering towards (4) like every tech company in California will up-sticks and leave, or it will so impair progress that China will inevitably win, so the bill will not achieve its stated aims.
AI Constitutions are a tool to reduce societal scale risk
This seems like really valuable work! And while situational awareness isn’t a sufficient condition for being able to fully automate many intellectual tasks, it seems like a necessary condition at least so this is already a much superior benchmark for ‘intelligence’ than e.g. MMLU.
I agree that this is a real possibility and in the table I did say at level 2,
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies [2]but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
From my perspective, it is entirely possible to have an alignment failure that works like this and occurs at difficulty level 2. This is still an ‘easier’ world than the higher levels because you can get killed in a much swifter and earlier way with far less warning in those worlds.
The reason I wouldn’t put it at level 8 is because presumably the models are following a reasonable proxy for what we want if it generalizes well beyond human level, but this proxy is inadequate in some ways that become apparent later on. The level 8 says not that any misgeneralization occurs but that rapid, unpredictable misgeneralization occurs around the human level such that alignment techniques quickly break down.
In the scenario you describe, there’d be an opportunity to notice what’s going on (after all you’d have superhuman AI that more or less does what it’s told to help you predict future consequences of even more superhuman AI) and the failure occurs much later.
“OpenAI appears to subscribe to that philosophy [of ‘bothsidesism’]. Also there seems to be a ‘popular opinion determines attention and truth’ thing here?”
OpenAI’s approach is well-intentioned but crude and might be counterproductive. The goal they should be aiming at is something best constructed as “have good moral and political epistemology”, something people are notoriously bad at by default.
Being vaguely both sidesist is a solution you see a lot with human institutions who don’t want to look biased so it’s not an unusually bad solution by any means but not good enough for high stakes situations.
What should the goal be? Instead of just presenting “both sides”, I think we should focus on making the AI acutely aware of the distinction between facts and values and especially in cases where there are values conflicts bringing that up. Making sure the model explicitly identifies and separates empirical claims from value judgments means that we can achieve better epistemics without resorting to false equivalences. Maybe for sufficiently unambiguous values that everyone shares we don’t want to do this but I think you should make the model biased towards saying “if X is what you value then do Y” whenever possible.
“This is weird. Why should the model need to spend tokens affirming that the user can believe what they wish? If information changes someone’s mind, that is a feature.”
Once again I think what they’re getting at is in principle good. I’d phrase it as the model should be biased towards being decision support orientated not persuasive. The strategy of writing persuasive content and then tacking on “but believe what you want!” is indeed a cheap hack that doesn’t solve the underlying issue. It would probably be better for the model to explicitly say when it’s being persuasive and when it’s not and err on the side of not persuading whenever possible but always be “meta honest” and upfront about what it thinks. That way we can at least be more assured it’s just being used for decision assistance when that’s all we want.
If you go with an assumption of good faith then the partial, gappy RSPs we’ve seen are still a major step towards having a functional internal policy to not develop dangerous AI systems because you’ll assume the gaps will be filled in due course. However, if we don’t assume a good faith commitment to implement a functional version of what’s suggested in a preliminary RSP without some kind of external pressure, then they might not be worth much more than the paper they’re printed on.
But, even if the RSPs aren’t drafted in good faith and the companies don’t have a strong safety culture (which seems to be true of OpenAI judging by what Jan Leike said), you can still have the RSP commitment rule be a foundation for actually effective policies down the line.
For comparison, if a lot of dodgy water companies sign on to a ‘voluntary compact’ to develop some sort of plan to assess the risk of sewage spills then probably the risk is reduced by a bit, but it also makes it easier to develop better requirements later, for example by saying “Our new requirement is the same as last years but now you must publish your risk assessment results openly” and daring them to back out. You can encourage them to compete on PR by making their commitments more comprehensive than their opponents and create a virtuous cycle, and it probably just draws more attention to the plans than there was before.
Maybe we have different definitions of DSA: I was thinking of it in terms of ‘resistance is futile’ and you can dictate whatever terms you want because you have overwhelming advantage, not that you could eventually after a struggle win a difficult war by forcing your opponent to surrender and accept unfavorable terms.
If say the US of 1965 was dumped into post WW2 Earth it would have the ability to dictate whatever terms it wanted because it would be able to launch hundreds of ICBMS at enemy cities at will. If the real US of 1949 had started a war against the Soviets it would probably have been able to cripple an advance into western Europe but likely wouldn’t have been able to get its bombers through to devastate enough of the soviet homeland with the few bombs they had.
Remember the soviets did just lose a huge percentage of their population and industry in WW2 and kept fighting. The fact that it’s at all debatable who would have won if WW3 started in the late 1940s at all (see e.g. here) makes me think nuclear weapons weren’t at that time a DSA producer.
We do discuss this in the article and tried to convey that it is a very significant downside of SA. All 3 plans have enormous downsides though, so a plan posing massive risks is not disqualifying. The key is understanding when these risks might be worth taking given the alternatives.
CD might be too weak if TAI is offense-dominant, regardless of regulations or cooperative partnerships, and result in misuse or misalignment catastrophe
If GM fails it might blow any chance of producing protective TAI and hand over the lead to the most reckless actors.
SA might directly provoke a world war or produce unaligned AGI ahead of schedule.
SA is favored when alignment is easy or moderately difficult (e.g. at the level where interpretability probes, scalable oversight etc. help) with high probability, and you expect to win the arms race. But it doesn’t require you to be the ‘best’. The key isn’t whether US control is better than Chinese control, but whether centralized development under any actor is preferable to widespread proliferation of TAI capabilities to potentially malicious actors
Regarding whether the US (remember on SA there’s assumed to be extensive government oversight) is better than the CCP: I think the answer is yes and I talk a bit more about why here. I don’t consider US AI control being better than Chinese AI control to be the most important argument in favor of SA, however. That fact alone doesn’t remotely justify SA: you also need easy/moderate alignment and you need good evidence than an arms race is likely unavoidable regardless of what we recommend.