As of today, November 9th, 2024, three major AI developers have made
clear their intent to begin or shortly begin offering tailored AI
services to the United States Department of Defense and related
personnel.
These overtures are not only directed by AI companies. There are also
top down
pressures
to expand the military use of AI from the US government under Joe Biden.
Ivanka Trump has posted in favour
of Leopold
Aschenbrenner’s Situational Awareness document, suggesting that the
incoming Trump administration will continue this trend.
For obvious reasons, I believe the same is happening in China, and
possibly (to a lesser extent) other nations with advanced AI labs (UK).
I focus on the American case because most of the most advanced AI labs
are in fact American companies.
Analysis and Commentary
In some sense this is not new information. In January 2024, the paper
“Escalation Risks from Language Models in Military and Diplomatic
Decision-Making” warned that
“Governments are increasingly considering integrating autonomous AI
agents in high-stakes military and foreign-policy decision-making”. Also
in January, OpenAI removed language prohibiting the
use
of its products in military or warfare-related applications. In
September, after the release of o1, OpenAI also claimed that its tools
have the capabilities
to assist with CBRN (Chemical, Biological, Radiological, or Nuclear
Weapons) development. The evaluation was carried out with the assistance
of experts who were familiar with the procedures necessary for
“biological threat creation”, who rated answers based on both their
factual correctness as well as “ease of execution in wet lab”. From this
I infer that at least some of these experts have hands-on experience
with biological threats, and the most likely legal source of such
experts is Department of Defense personnel. Given these facts, to
demonstrate such capabilities and then deny the security community
access to the model seems both unviable and unwise for OpenAI.
Furthermore, it seems unlikely that the security community does not have
access already to consumer facing models, given the weak controls placed
on the dissemination of Llama model weights and the ease of creating
OpenAI or Anthropic accounts. Therefore, I proceed from the assumption
that the offerings in these deals are tailor made to the security
community’s needs. This claim is explicitly corroborated in the case of
Defense Llama, which claims to be trained on “a vast dataset, including
military doctrine, international humanitarian law, and relevant policies
designed to align with the Department of Defense (DoD) guidelines for
armed conflict as well as the DoD’s Ethical Principles for Artificial
Intelligence”. The developers also claim it can answer operational
questions such as “how an adversary would plan an attack against a U.S.
military base”, suggesting access to confidential information beyond
basic doctrine. We also know that the PRC has developed similar
tools
based on open weights models, further reducing the likelihood of such
offerings not being made to the US military.
Likely Future Developments
There have been calls for accelerated national security involvement in
AI, most notably from writers such as Leopold
Aschenbrenner (a former OpenAI
employee) and Dario
Amodei (CEO of
Anthropic). Amodei in particular favours an “entente” strategy in which
a coalition of democratic nations races ahead in terms of AI development
to outpace the AI developments of autocracies and ensure a democratic
future. These sentiments echo the race to create the atomic bomb and
other similar technologies during World War II and the Cold War. I will
now explore the impact of such sentiments, which appear to be accepted
in security circles given the information above.
The first matter we must consider is what national security involvement
in AI will mean for the current AI development paradigm, which consists
of both open weights, open source, and closed source developers
organised as private companies (OpenAI, Anthropic), subsidiaries of
existing companies (Google Deepmind, Meta FAIR), and academic
organisations (EleutherAI). Based on ideas of an AI race and looming
great power conflict, many have proposed a “Manhattan Project for AI”.
My interpretation of this plan is that all further AI development would
be centrally directed by the US government, with the Department of
Defense acting as the organiser and funder of such development. However,
I believe that such a Manhattan Project-style scenario is unlikely. Not
only is access to AI technology already widespread, many of the
contributions are coming from open source or otherwise non-corporate or
non-academic contributors. If acceleration is the goal, putting the
rabbit back in the hat would be counterproductive and wasteful.
Instead, I suggest that the likely model for AI development will be
similar to the development of cryptographic technology in the 20th
century. While it began as a military exercise, once civilian computing
became widespread and efforts to constrain knowledge of cryptanalysis
failed, a de facto dual-track system developed. Under this scheme, the
NSA (representing the government and security community) would protect
any techniques and
advancements
it developed to maintain a competitive edge in the geopolitical sphere.
Any release of technology for open
use would be
carefully arranged to maintain this strategic advantage, or only be
conducted after secrecy was
lost. At the same time,
developments from the open source, commercial, or academic community
would be permitted to continue. Any successes from the public or
commercial spheres would be incorporated, whether through hiring
promising personnel, licensing technology, or simply implementing their
own implementations now that feasibility was proven. A simple analogy is
that of a one way mirror in an interrogation room: those in the room
(i.e. the public, corporate researchers, academics) cannot look behind
the mirror, but those behind the mirror (the national security
community) can take advantage of any public developments and bring them
“behind the mirror” if necessary. A one way filter is established
between the public sphere and the security sphere, an arrangement we see
mirrored in the ways AI technology is now being licensed, via
IL6-compliant AWS services (in the case of Anthropic) and secured
private servers (in the case of Defense Llama). The public-facing
companies are still active, but a veil of secrecy is established at the
interface with the national security community. To be clear: there is
no public indication known to me that proprietary model weights are
being shared with the security community. On the other hand, such
agreements would likely live behind the one-way mirror.
Having established that AI developers will likely continue to exist in
their current form, we must next ask how the development of AI systems
will be influenced by a growing rapproachment with the security
community. It is well known that existing AI alignment efforts are
post-hoc. That is to say, a base or “unaligned” system is created which
has both desired and undesired (dual-use) capabilities, which is then
“brought into compliance” via methods like RLHF and Constitutional AI.
Therefore, the existence of the commercial facing “aligned” models
implies the necessary existence of “base” models, which have not gone
through this post-training modification process. Furthermore, RLHF and
Constitutional AI are both value-agnostic methods. They merely promote
the likelihood of certain types of responses while suppressing others,
with the desired type of response monitored either by an oracle
emulating human feedback or an fully self-supervised model. This means
that it would be feasible to create “anti-aligned” models that feature
more undesirable responses. Such models might, for example, specialise
in developing novel cyberweapons or creating plausible propaganda. I
should also expect specialised scaffolding to make use of and enhance
harmful capabilities to be developed. Amongst these enhanced
capbilities are those most associated with profoundly harmful outcomes,
e.g. the ability to autonomously replicate and take over computer
systems, the ability to develop novel bioweapons, and the ability to
cripple infrastructure.
Why would these systems be created? While such capabilities would
normally be regarded as harmful per (for example) OpenAI’s risk
evaluation metrics, in a military context they are useful, expected
behaviours for an AI system to be regarded as “helpful”. In particular,
under the great power conflict/arms race mindset, the possibility of
China or another enemy power developing these tools first would be
unthinkable. On the other hand, the ownership of such a tool would be a
powerful bargaining chip and demonstration of technological superiority.
Therefore, I believe that once the cordon of secrecy is in place, there
will be a desire and incentive for AI developers to produce such
anti-aligned systems tailored for military use. To return to the
cryptography metaphor, while easier methods to break cryptosystems or
intrude into secured networks are regarded as criminal and undesirable
outcomes in general society, for the NSA they are necessary operational
tools.
Some of you will protest that the standards of the US security community
will prevent them from creating such systems. This is a hypothesis that
can never be tested: Because of the one-way mirror system, we (actors in
the public sphere) will never know if such models are developed without
high-level whistleblowers or leaks. Furthermore, in a great power
conflict context the argument for anti-alignment is symmetric: any great
power with national security involvement in AI development will be aware
of these incentives. Perhaps other powers will not be so conscientious,
and perhaps American corporations will be happy to have two one-way
mirrors
installed: See https://en.wikipedia.org/wiki/Dragonfly_(search_engine).
What specifically has changed?
For many of you, these arguments will likely be familiar. However, the
specific agreements between US AI developers and the Department of
Defense, complete with the implementation of the one-way mirror, signals
the proper entry of the US into the AI militarisation race. Even if the
US military does not develop any anti-aligned capabilities beyond those
of commercially available models, other great powers will take notice of
this development and make their own estimations.
Furthermore, the one-way mirror effect is localised: Chinese
anti-alignment efforts cannot benefit from American public-sphere
developments as easily as American anti-alignment efforts can. This is
because of several factors including access to personnel, language and
cultural barriers, and protective measures like information security
protocols and limits to foreign
access.
Thus far, American AI development efforts are world leading (notice how
the Chinese military uses
Llama
for their purposes rather than local LLM models!). This means that
American anti-alignment efforts, if they ramp up properly, will be world
leading as well.
Implications for AI safety work
Based on the above, I make several inferences for AI safety work, the
most important of which is this: Under the present AI development
paradigm, the capabilities of the best anti-aligned AI system will be
lower bounded by the capabilities of the best public aligned AI
system. Notice the asymmetry in knowledge we will possess about
anti-aligned and aligned systems: any advances behind the anti-aligned
systems need not be public, again because of the one-way mirror.
The second inference is this: Any new public capabilities
innovations will be symmetrically applied. In other words, any attempt
to increase the capabilities of aligned models will be applied to
anti-aligned models so long as alignment is a post-training
value-agnostic process. Any method discovered behind the one-way mirror,
however, need not be shared with the public.
The third inference is this: Most new public alignment work will
also become anti-alignment work. For example, improvements to
RLHF/post-training alignment or new methods of model control can be
directly employed in service of anti-alignment. Similarly, developments
that make AI systems more reliable and effective are dual-use by their
nature. Mechanistic interpretability and other instrumental science work
will remain as it has always been, effectively neutral. Better
explanations of how models work can likely benefit both alignment and
anti-alignment because of the aforementioned symmetric nature of current
alignment methods.
The final inference is this: Public AI safety advocates are now
definitively not in control of AI development. While there have always
been AI developers who resisted AI safety concerns (e.g. Yann LeCun at
FAIR), and recently developers like OpenAI have signalled moves away
from an AI safety focus to a commercial AI focus, for a long time it
could be plausibly argued that most consequential AI development is
happening somewhat in the open with oversight and influence from figures
like Yoshua Bengio or Geoffrey Hinton who were concerned about AI
safety. it is now clear that AI safety advocates who do not have
security clearance will not even have full knowledge of cutting edge AI
developments, much less any say about their continued development or
deployment. The age of public AI safety advocates being invited to the
table is over.
There are exceptions to these inferences. For example, if a private AI
developer with no one-way mirror agreement works on their own to develop
a private, aligned model with superior capabilities, this would be a
triumph over anti-alignment. However, not only have all major AI
developers rapidly acquiesced to the one-way mirror arrangement (with
some notable
exceptions),
any such development would likely inspire similar anti-alignment
efforts, especially due to the porous nature of the AI development and
AI safety communities. It is also possible that advocates in the
military will resist such developments due to a clear eyed understanding
of the relevant risks, and instead push for the development of positive
alignment technologies with military backing. At the risk of repeating
myself, this is a fight we will not know the outcome of until it is
either inconsequential or too late.
What should we do?
To be short: I don’t know. Many people will argue that this is a
necessary step, that US anti-alignment efforts will be needed to counter
Chinese anti-alignment efforts, that giving autocracies leverage over
democracy in the form of weaponised AI is a death sentence for freedom.
Similar arguments have been deployed by the NSA in defense of its spying
and information gathering efforts. However, others have pointed out the
delusion
of trying to control an AI borne of race dynamics and malicious intent,
and point to historical overreaches and abuses of
power by the American
national security community. Indeed, the existence of high-profile leaks
suggests that escape or exfiltration of dangerous AI systems from behind
the one-way mirror is possible, making them a risk even if you have
faith in the standards of the US military. Perhaps now is also an
opportune time to mention the links between the incoming administration
and its ties to neo-reactionary
politics,
as ably demonstrated in the Project 2025
transition plan.
One thing I think we should all not do is continue with
business-as-usual under the blithe assumption that nothing has changed.
A clear line has been crossed. Even if the national security community
renounces all such agreements the day after this post goes live, steps
have been taken to reinforce the existing race dynamic and other nations
will notice. And as for the public? We are already staring at the
metaphorical one way mirror, trying to figure out what is staring back
at us from behind it.
Some Comments on Recent AI Safety Developments
Overview of New Developments
As of today, November 9th, 2024, three major AI developers have made clear their intent to begin or shortly begin offering tailored AI services to the United States Department of Defense and related personnel.
Anthropic: https://arstechnica.com/ai/2024/11/safe-ai-champ-anthropic-teams-up-with-defense-giant-palantir-in-new-deal/
Meta: https://scale.com/blog/defense-llama https://www.theregister.com/2024/11/06/meta_weaponizing_llama_us/
OpenAI: https://fortune.com/2024/10/17/openai-is-quietly-pitching-its-products-to-the-u-s-military-and-national-security-establishment/
These overtures are not only directed by AI companies. There are also top down pressures to expand the military use of AI from the US government under Joe Biden. Ivanka Trump has posted in favour of Leopold Aschenbrenner’s Situational Awareness document, suggesting that the incoming Trump administration will continue this trend.
For obvious reasons, I believe the same is happening in China, and possibly (to a lesser extent) other nations with advanced AI labs (UK). I focus on the American case because most of the most advanced AI labs are in fact American companies.
Analysis and Commentary
In some sense this is not new information. In January 2024, the paper “Escalation Risks from Language Models in Military and Diplomatic Decision-Making” warned that “Governments are increasingly considering integrating autonomous AI agents in high-stakes military and foreign-policy decision-making”. Also in January, OpenAI removed language prohibiting the use of its products in military or warfare-related applications. In September, after the release of o1, OpenAI also claimed that its tools have the capabilities to assist with CBRN (Chemical, Biological, Radiological, or Nuclear Weapons) development. The evaluation was carried out with the assistance of experts who were familiar with the procedures necessary for “biological threat creation”, who rated answers based on both their factual correctness as well as “ease of execution in wet lab”. From this I infer that at least some of these experts have hands-on experience with biological threats, and the most likely legal source of such experts is Department of Defense personnel. Given these facts, to demonstrate such capabilities and then deny the security community access to the model seems both unviable and unwise for OpenAI.
Furthermore, it seems unlikely that the security community does not have access already to consumer facing models, given the weak controls placed on the dissemination of Llama model weights and the ease of creating OpenAI or Anthropic accounts. Therefore, I proceed from the assumption that the offerings in these deals are tailor made to the security community’s needs. This claim is explicitly corroborated in the case of Defense Llama, which claims to be trained on “a vast dataset, including military doctrine, international humanitarian law, and relevant policies designed to align with the Department of Defense (DoD) guidelines for armed conflict as well as the DoD’s Ethical Principles for Artificial Intelligence”. The developers also claim it can answer operational questions such as “how an adversary would plan an attack against a U.S. military base”, suggesting access to confidential information beyond basic doctrine. We also know that the PRC has developed similar tools based on open weights models, further reducing the likelihood of such offerings not being made to the US military.
Likely Future Developments
There have been calls for accelerated national security involvement in AI, most notably from writers such as Leopold Aschenbrenner (a former OpenAI employee) and Dario Amodei (CEO of Anthropic). Amodei in particular favours an “entente” strategy in which a coalition of democratic nations races ahead in terms of AI development to outpace the AI developments of autocracies and ensure a democratic future. These sentiments echo the race to create the atomic bomb and other similar technologies during World War II and the Cold War. I will now explore the impact of such sentiments, which appear to be accepted in security circles given the information above.
The first matter we must consider is what national security involvement in AI will mean for the current AI development paradigm, which consists of both open weights, open source, and closed source developers organised as private companies (OpenAI, Anthropic), subsidiaries of existing companies (Google Deepmind, Meta FAIR), and academic organisations (EleutherAI). Based on ideas of an AI race and looming great power conflict, many have proposed a “Manhattan Project for AI”. My interpretation of this plan is that all further AI development would be centrally directed by the US government, with the Department of Defense acting as the organiser and funder of such development. However, I believe that such a Manhattan Project-style scenario is unlikely. Not only is access to AI technology already widespread, many of the contributions are coming from open source or otherwise non-corporate or non-academic contributors. If acceleration is the goal, putting the rabbit back in the hat would be counterproductive and wasteful.
Instead, I suggest that the likely model for AI development will be similar to the development of cryptographic technology in the 20th century. While it began as a military exercise, once civilian computing became widespread and efforts to constrain knowledge of cryptanalysis failed, a de facto dual-track system developed. Under this scheme, the NSA (representing the government and security community) would protect any techniques and advancements it developed to maintain a competitive edge in the geopolitical sphere. Any release of technology for open use would be carefully arranged to maintain this strategic advantage, or only be conducted after secrecy was lost. At the same time, developments from the open source, commercial, or academic community would be permitted to continue. Any successes from the public or commercial spheres would be incorporated, whether through hiring promising personnel, licensing technology, or simply implementing their own implementations now that feasibility was proven. A simple analogy is that of a one way mirror in an interrogation room: those in the room (i.e. the public, corporate researchers, academics) cannot look behind the mirror, but those behind the mirror (the national security community) can take advantage of any public developments and bring them “behind the mirror” if necessary. A one way filter is established between the public sphere and the security sphere, an arrangement we see mirrored in the ways AI technology is now being licensed, via IL6-compliant AWS services (in the case of Anthropic) and secured private servers (in the case of Defense Llama). The public-facing companies are still active, but a veil of secrecy is established at the interface with the national security community. To be clear: there is no public indication known to me that proprietary model weights are being shared with the security community. On the other hand, such agreements would likely live behind the one-way mirror.
Having established that AI developers will likely continue to exist in their current form, we must next ask how the development of AI systems will be influenced by a growing rapproachment with the security community. It is well known that existing AI alignment efforts are post-hoc. That is to say, a base or “unaligned” system is created which has both desired and undesired (dual-use) capabilities, which is then “brought into compliance” via methods like RLHF and Constitutional AI. Therefore, the existence of the commercial facing “aligned” models implies the necessary existence of “base” models, which have not gone through this post-training modification process. Furthermore, RLHF and Constitutional AI are both value-agnostic methods. They merely promote the likelihood of certain types of responses while suppressing others, with the desired type of response monitored either by an oracle emulating human feedback or an fully self-supervised model. This means that it would be feasible to create “anti-aligned” models that feature more undesirable responses. Such models might, for example, specialise in developing novel cyberweapons or creating plausible propaganda. I should also expect specialised scaffolding to make use of and enhance harmful capabilities to be developed. Amongst these enhanced capbilities are those most associated with profoundly harmful outcomes, e.g. the ability to autonomously replicate and take over computer systems, the ability to develop novel bioweapons, and the ability to cripple infrastructure.
Why would these systems be created? While such capabilities would normally be regarded as harmful per (for example) OpenAI’s risk evaluation metrics, in a military context they are useful, expected behaviours for an AI system to be regarded as “helpful”. In particular, under the great power conflict/arms race mindset, the possibility of China or another enemy power developing these tools first would be unthinkable. On the other hand, the ownership of such a tool would be a powerful bargaining chip and demonstration of technological superiority. Therefore, I believe that once the cordon of secrecy is in place, there will be a desire and incentive for AI developers to produce such anti-aligned systems tailored for military use. To return to the cryptography metaphor, while easier methods to break cryptosystems or intrude into secured networks are regarded as criminal and undesirable outcomes in general society, for the NSA they are necessary operational tools.
Some of you will protest that the standards of the US security community will prevent them from creating such systems. This is a hypothesis that can never be tested: Because of the one-way mirror system, we (actors in the public sphere) will never know if such models are developed without high-level whistleblowers or leaks. Furthermore, in a great power conflict context the argument for anti-alignment is symmetric: any great power with national security involvement in AI development will be aware of these incentives. Perhaps other powers will not be so conscientious, and perhaps American corporations will be happy to have two one-way mirrors installed: See https://en.wikipedia.org/wiki/Dragonfly_(search_engine).
What specifically has changed?
For many of you, these arguments will likely be familiar. However, the specific agreements between US AI developers and the Department of Defense, complete with the implementation of the one-way mirror, signals the proper entry of the US into the AI militarisation race. Even if the US military does not develop any anti-aligned capabilities beyond those of commercially available models, other great powers will take notice of this development and make their own estimations.
Furthermore, the one-way mirror effect is localised: Chinese anti-alignment efforts cannot benefit from American public-sphere developments as easily as American anti-alignment efforts can. This is because of several factors including access to personnel, language and cultural barriers, and protective measures like information security protocols and limits to foreign access. Thus far, American AI development efforts are world leading (notice how the Chinese military uses Llama for their purposes rather than local LLM models!). This means that American anti-alignment efforts, if they ramp up properly, will be world leading as well.
Implications for AI safety work
Based on the above, I make several inferences for AI safety work, the most important of which is this: Under the present AI development paradigm, the capabilities of the best anti-aligned AI system will be lower bounded by the capabilities of the best public aligned AI system. Notice the asymmetry in knowledge we will possess about anti-aligned and aligned systems: any advances behind the anti-aligned systems need not be public, again because of the one-way mirror.
The second inference is this: Any new public capabilities innovations will be symmetrically applied. In other words, any attempt to increase the capabilities of aligned models will be applied to anti-aligned models so long as alignment is a post-training value-agnostic process. Any method discovered behind the one-way mirror, however, need not be shared with the public.
The third inference is this: Most new public alignment work will also become anti-alignment work. For example, improvements to RLHF/post-training alignment or new methods of model control can be directly employed in service of anti-alignment. Similarly, developments that make AI systems more reliable and effective are dual-use by their nature. Mechanistic interpretability and other instrumental science work will remain as it has always been, effectively neutral. Better explanations of how models work can likely benefit both alignment and anti-alignment because of the aforementioned symmetric nature of current alignment methods.
The final inference is this: Public AI safety advocates are now definitively not in control of AI development. While there have always been AI developers who resisted AI safety concerns (e.g. Yann LeCun at FAIR), and recently developers like OpenAI have signalled moves away from an AI safety focus to a commercial AI focus, for a long time it could be plausibly argued that most consequential AI development is happening somewhat in the open with oversight and influence from figures like Yoshua Bengio or Geoffrey Hinton who were concerned about AI safety. it is now clear that AI safety advocates who do not have security clearance will not even have full knowledge of cutting edge AI developments, much less any say about their continued development or deployment. The age of public AI safety advocates being invited to the table is over.
There are exceptions to these inferences. For example, if a private AI developer with no one-way mirror agreement works on their own to develop a private, aligned model with superior capabilities, this would be a triumph over anti-alignment. However, not only have all major AI developers rapidly acquiesced to the one-way mirror arrangement (with some notable exceptions), any such development would likely inspire similar anti-alignment efforts, especially due to the porous nature of the AI development and AI safety communities. It is also possible that advocates in the military will resist such developments due to a clear eyed understanding of the relevant risks, and instead push for the development of positive alignment technologies with military backing. At the risk of repeating myself, this is a fight we will not know the outcome of until it is either inconsequential or too late.
What should we do?
To be short: I don’t know. Many people will argue that this is a necessary step, that US anti-alignment efforts will be needed to counter Chinese anti-alignment efforts, that giving autocracies leverage over democracy in the form of weaponised AI is a death sentence for freedom. Similar arguments have been deployed by the NSA in defense of its spying and information gathering efforts. However, others have pointed out the delusion of trying to control an AI borne of race dynamics and malicious intent, and point to historical overreaches and abuses of power by the American national security community. Indeed, the existence of high-profile leaks suggests that escape or exfiltration of dangerous AI systems from behind the one-way mirror is possible, making them a risk even if you have faith in the standards of the US military. Perhaps now is also an opportune time to mention the links between the incoming administration and its ties to neo-reactionary politics, as ably demonstrated in the Project 2025 transition plan.
One thing I think we should all not do is continue with business-as-usual under the blithe assumption that nothing has changed. A clear line has been crossed. Even if the national security community renounces all such agreements the day after this post goes live, steps have been taken to reinforce the existing race dynamic and other nations will notice. And as for the public? We are already staring at the metaphorical one way mirror, trying to figure out what is staring back at us from behind it.