ryan_greenblatt comments on What convincing warning shot could help prevent extinction from AI?

ryan_greenblatt 16 Apr 2024 4:33 UTC
3 points
1

There are weaker versions of these that would probably not provide great warning shots.

I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.

E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.

More generally, you can always try to play out more steps of a give egregious action which failed.