I think this idea, though quite simple and obvious, is very important. I think coup probes are the paradigmatic example of a safety technique that uses model internals access, and they’re an extremely helpful concrete baseline to think about in many cases, e.g. when considering safety cases via mech interp. I refer to this post constantly. We followed up on it in Catching AIs red-handed. (We usually call them “off-policy probes” now.)
Unfortunately, this paper hasn’t been followed up with as much empirical research as I’d hoped; Anthropic’s Simple probes can catch sleep agents explores a different technique that I think is less promising or important than the one in this paper. There are some empirical projects following up on this project now, though. EDIT: Also this is a good empirical follow-up.
[COI notice: this is a Redwood Research output]
I think this idea, though quite simple and obvious, is very important. I think coup probes are the paradigmatic example of a safety technique that uses model internals access, and they’re an extremely helpful concrete baseline to think about in many cases, e.g. when considering safety cases via mech interp. I refer to this post constantly. We followed up on it in Catching AIs red-handed. (We usually call them “off-policy probes” now.)
Unfortunately, this paper hasn’t been followed up with as much empirical research as I’d hoped; Anthropic’s Simple probes can catch sleep agents explores a different technique that I think is less promising or important than the one in this paper. There are some empirical projects following up on this project now, though. EDIT: Also this is a good empirical follow-up.