peterbarnett answers What convincing warning shot could help prevent extinction from AI?

peterbarnett Apr 15, 2024, 5:43 PM
LW: 16 AF: 10
1
AF
An example which I’m not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn’t be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape.
Here are some examples:
- The AI starts copying its weights out, but it doesn’t know we have implemented an additional layer of security. We manage to catch it having copied its weights to a location they shouldn’t be, even though the new location is still secure.
- We catch the AI communicating with a compromised employee via encoded messages that we manage to decode. These messages have details about a novel vulnerability that the employee should put into the code which would allow the model to start escaping.
- The AI is caught using security vulnerabilities, that were discovered after the AI’s training data cutoff. We monitor for use of these vulnerabilities (without the AI knowing) and we see the AI inserting them into the code for reasons that are clearly for self- exfiltration (e.g. changing communication protocols between computers, inserting a bug that causes network traffic to be inaccurately logged).
I’m definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned.
There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape.
- Akash Apr 15, 2024, 9:32 PM
  25 points
  9
  Parent
  I like these examples. One thing I’ll note, however, is that I think the “warning shot discourse” on LW tends to focus on warning shots that would be convincing to a LW-style audience.
  If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.
  But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.
  [Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].
  I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.
  LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of copies and avoid being easily turned off, and we also understand that its general capabilities are sufficiently strong that we may be close to an intelligence explosion or highly capable AGI).
  In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks. It could just come across as “oh something weird happened but then the company caught it and fixed it.” Instead, I suspect the warning shots that are most relevant to natsec folks might be less “rational”, and by that I mean “less rooted in actual AGI threat models but more rooted in intuitive things that seem scary.”
  Examples of warning shots that I expect USG/natsec people would be more concerned about:
  - An AI system can generate novel weapons of mass destruction
  - Someone uses an AI system to hack into critical infrastructure or develop new zero-days that impress folks in the intelligence community.
  - A sudden increase in the military capabilities of AI
  These don’t relate as much to misalignment risk or misalignment-focused xrisk threat models. As a result, a disadvantage of these warning shots is that it may be harder to make a convincing case for interventions that focus specifically on misalignment. However, I think they are the kinds of things that might involve a sudden increase in the attention that the USG places on AI/AGI, the amount of resources it invests into understanding national security threats from AI, and its willingness to take major actions to intervene in the current AI race.
  As such, in addition to asking questions like “what is the kind of warning shot that would convince me and my friends that we have something dangerous”, I think it’s worth separately asking “what is the kind of warning shot that would convince natsec folks that something dangerous or important is occurring, regardless of whether or not it connects neatly to AGI risk threat models.”
  My impression is that the policy-relevant warning shots will be the most important ones to be prepared for, and the community may (for cultural/social/psychological reasons) be focusing too little effort on trying to prepare for these kinds of “irrational” warning shots.
  - davekasten Apr 16, 2024, 8:32 PM
    7 points
    2
    Parent
    As you know from our conversations, I’m largely in the same camp as you on this point.
    
    But one point I’d make incrementally is this: USG folks are also concerned about warning shots of the nature, “The President’s Daily Brief ran an article 6 months ago saying warning signs for dangerous thing X would be events W, Y, and Z, and today the PDB had an article saying our intelligence agencies assess that Y and Z have happened due to super secret stuff”.
    
    If rationalists want rationalist warning shots to be included, they need to convince relevant government analytic stakeholders of their relevance.
  - ryan_greenblatt Apr 16, 2024, 4:29 AM
    4 points
    0
    Parent
    
    In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks.
    
    It seems worth noting that in the cases that Peter mentioned, you might be able to “play out” the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the “warning shots” you mentioned will come earlier and be more compelling regardless.)
    
    E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.
    
    (There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)
    
    I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.
- Charbel-Raphaël Feb 26, 2025, 6:43 PM
  LW: 4 AF: 1
  −7
  AF Parent
  Coming back to this comment: we got a few clear examples, and nobody seems to care:
  “In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.”—Anthropic, in the Alignment Faking paper.
  This time we catched it. Next time, maybe we won’t be able to catch it.
  - peterbarnett Feb 26, 2025, 6:59 PM
    2 points
    2
    Parent
    I partially agree. Although I think this result would be a lot more convincing if the situation wasn’t this contrived (eg Anthropic in the scenario is doing a bad thing, a sympathetic lab employee talks to the model, breaking out is unrealistically easy).
    I think if the model was caught trying to escape (especially in normal scenarios where Anthropic wasn’t being evil) by carrying out sophisticated hacks (eg weakening security, learning about security vulnerabilities, coordinating with other copies etc), in a wide range of scenarios, then this would be much more convinvcing
    - Charbel-Raphaël Feb 27, 2025, 6:48 AM
      2 points
      0
      Parent
      Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula warning shot.
- ryan_greenblatt Apr 16, 2024, 4:39 AM
  4 points
  0
  Parent
  
  I’m definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.
  
  I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.
  - Akash Apr 16, 2024, 8:22 PM
    4 points
    0
    Parent
    I’ll admit I have only been loosely following the control stuff, but FWIW I would be excited about a potential @peterbarnett & @ryan_greenblatt dialogue in which you two to try to identify & analyze any potential disagreements. Example questions:
    What is the most capable system that you think we are likely to be able to control?
    What kind of value do you think we could get out of such a system?
    To what extent do you expect that system to be able to produce insights that help us escape the acute risk period (i.e., get out of a scenario where someone else can come along and build a catastrophe-capable system without implementing control procedures or someone else comes along and scales to the point where the control procedures are no longer sufficient)
    - ryan_greenblatt Apr 16, 2024, 9:34 PM
      8 points
      2
      Parent
      (You might be interested in the Dialogue I already did with habryka.)
- ryan_greenblatt Apr 16, 2024, 4:33 AM
  3 points
  1
  Parent
  
  There are weaker versions of these that would probably not provide great warning shots.
  
  I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.
  
  E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.
  
  More generally, you can always try to play out more steps of a give egregious action which failed.