Dakara comments on Simple probes can catch sleeper agents

Dakara 18 Nov 2024 10:07 UTC
LW: 3 AF: 2
0
AF
This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.
- evhub 18 Nov 2024 20:22 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Our work here is not arguing that probing is a perfect solution in general; it’s just a single datapoint of how it fares on the models from our Sleeper Agents paper.