This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.
I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.
Our work here is not arguing that probing is a perfect solution in general; it’s just a single datapoint of how it fares on the models from our Sleeper Agents paper.
This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.
I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.
Our work here is not arguing that probing is a perfect solution in general; it’s just a single datapoint of how it fares on the models from our Sleeper Agents paper.