I think this is an interesting consideration, but I’m not sure it changes the strategic situation much (not that you claimed this). Additional considerations that lessen the impact:
There may have been lots of “warning shots” already, e.g., in the form of even weaker AIs trying to take over virtual environments (in which they were being trained/evaluated), and deliberately induced takeover attempts by people trying to raise awareness of takeover risk.
It seems easy to train away the (near-term) tendency to attempt takeovers (if we assign high loss for failed attempts during training), without fixing the root problem, for example by instilling a bias towards thinking that takeover attempts are even less likely to succeed than they actually are, or creating deontological blocks against certain behaviors. So maybe there are some warnings shots at this point or earlier, but it quickly gets trained away and few people think more about it.
Trying to persuade/manipulate humans seems like it will be a common strategy for various reasons. But there is no bright line between manipulation and helpful conversation which makes such attempts less serviceable as warning shots.
Many will downplay the warning shots, like, of course there would be takeover attempts during training. We haven’t finished aligning the AI yet! What did you expect?
I think this is an interesting consideration, but I’m not sure it changes the strategic situation much (not that you claimed this). Additional considerations that lessen the impact:
There may have been lots of “warning shots” already, e.g., in the form of even weaker AIs trying to take over virtual environments (in which they were being trained/evaluated), and deliberately induced takeover attempts by people trying to raise awareness of takeover risk.
It seems easy to train away the (near-term) tendency to attempt takeovers (if we assign high loss for failed attempts during training), without fixing the root problem, for example by instilling a bias towards thinking that takeover attempts are even less likely to succeed than they actually are, or creating deontological blocks against certain behaviors. So maybe there are some warnings shots at this point or earlier, but it quickly gets trained away and few people think more about it.
Trying to persuade/manipulate humans seems like it will be a common strategy for various reasons. But there is no bright line between manipulation and helpful conversation which makes such attempts less serviceable as warning shots.
Many will downplay the warning shots, like, of course there would be takeover attempts during training. We haven’t finished aligning the AI yet! What did you expect?