I like these examples. One thing I’ll note, however, is that I think the “warning shot discourse” on LW tends to focus on warning shots that would be convincing to a LW-style audience.
If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.
But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.
[Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].
I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.
LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of copies and avoid being easily turned off, and we also understand that its general capabilities are sufficiently strong that we may be close to an intelligence explosion or highly capable AGI).
In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks. It could just come across as “oh something weird happened but then the company caught it and fixed it.” Instead, I suspect the warning shots that are most relevant to natsec folks might be less “rational”, and by that I mean “less rooted in actual AGI threat models but more rooted in intuitive things that seem scary.”
Examples of warning shots that I expect USG/natsec people would be more concerned about:
An AI system can generate novel weapons of mass destruction
Someone uses an AI system to hack into critical infrastructure or develop new zero-days that impress folks in the intelligence community.
A sudden increase in the military capabilities of AI
These don’t relate as much to misalignment risk or misalignment-focused xrisk threat models. As a result, a disadvantage of these warning shots is that it may be harder to make a convincing case for interventions that focus specifically on misalignment. However, I think they are the kinds of things that might involve a sudden increase in the attention that the USG places on AI/AGI, the amount of resources it invests into understanding national security threats from AI, and its willingness to take major actions to intervene in the current AI race.
As such, in addition to asking questions like “what is the kind of warning shot that would convince me and my friends that we have something dangerous”, I think it’s worth separately asking “what is the kind of warning shot that would convince natsec folks that something dangerous or important is occurring, regardless of whether or not it connects neatly to AGI risk threat models.”
My impression is that the policy-relevant warning shots will be the most important ones to be prepared for, and the community may (for cultural/social/psychological reasons) be focusing too little effort on trying to prepare for these kinds of “irrational” warning shots.
As you know from our conversations, I’m largely in the same camp as you on this point.
But one point I’d make incrementally is this: USG folks are also concerned about warning shots of the nature, “The President’s Daily Brief ran an article 6 months ago saying warning signs for dangerous thing X would be events W, Y, and Z, and today the PDB had an article saying our intelligence agencies assess that Y and Z have happened due to super secret stuff”.
If rationalists want rationalist warning shots to be included, they need to convince relevant government analytic stakeholders of their relevance.
In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks.
It seems worth noting that in the cases that Peter mentioned, you might be able to “play out” the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the “warning shots” you mentioned will come earlier and be more compelling regardless.)
E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.
(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)
I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.
I like these examples. One thing I’ll note, however, is that I think the “warning shot discourse” on LW tends to focus on warning shots that would be convincing to a LW-style audience.
If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.
But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.
[Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].
I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.
LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of copies and avoid being easily turned off, and we also understand that its general capabilities are sufficiently strong that we may be close to an intelligence explosion or highly capable AGI).
In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks. It could just come across as “oh something weird happened but then the company caught it and fixed it.” Instead, I suspect the warning shots that are most relevant to natsec folks might be less “rational”, and by that I mean “less rooted in actual AGI threat models but more rooted in intuitive things that seem scary.”
Examples of warning shots that I expect USG/natsec people would be more concerned about:
An AI system can generate novel weapons of mass destruction
Someone uses an AI system to hack into critical infrastructure or develop new zero-days that impress folks in the intelligence community.
A sudden increase in the military capabilities of AI
These don’t relate as much to misalignment risk or misalignment-focused xrisk threat models. As a result, a disadvantage of these warning shots is that it may be harder to make a convincing case for interventions that focus specifically on misalignment. However, I think they are the kinds of things that might involve a sudden increase in the attention that the USG places on AI/AGI, the amount of resources it invests into understanding national security threats from AI, and its willingness to take major actions to intervene in the current AI race.
As such, in addition to asking questions like “what is the kind of warning shot that would convince me and my friends that we have something dangerous”, I think it’s worth separately asking “what is the kind of warning shot that would convince natsec folks that something dangerous or important is occurring, regardless of whether or not it connects neatly to AGI risk threat models.”
My impression is that the policy-relevant warning shots will be the most important ones to be prepared for, and the community may (for cultural/social/psychological reasons) be focusing too little effort on trying to prepare for these kinds of “irrational” warning shots.
As you know from our conversations, I’m largely in the same camp as you on this point.
But one point I’d make incrementally is this: USG folks are also concerned about warning shots of the nature, “The President’s Daily Brief ran an article 6 months ago saying warning signs for dangerous thing X would be events W, Y, and Z, and today the PDB had an article saying our intelligence agencies assess that Y and Z have happened due to super secret stuff”.
If rationalists want rationalist warning shots to be included, they need to convince relevant government analytic stakeholders of their relevance.
It seems worth noting that in the cases that Peter mentioned, you might be able to “play out” the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the “warning shots” you mentioned will come earlier and be more compelling regardless.)
E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.
(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)
I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.