In contrast, without this context, I don’t think that “we caught an AI model copying its weights” would necessarily be a warning shot for USG/natsec folks.
It seems worth noting that in the cases that Peter mentioned, you might be able to “play out” the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the “warning shots” you mentioned will come earlier and be more compelling regardless.)
E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.
(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)
I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.
It seems worth noting that in the cases that Peter mentioned, you might be able to “play out” the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the “warning shots” you mentioned will come earlier and be more compelling regardless.)
E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.
(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)
I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.