There are weaker versions of these that would probably not provide great warning shots.
I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.
E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.
More generally, you can always try to play out more steps of a give egregious action which failed.
I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.
E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.
More generally, you can always try to play out more steps of a give egregious action which failed.