nostalgebraist comments on Simple probes can catch sleeper agents

nostalgebraist 24 Apr 2024 16:27 UTC
LW: 44 AF: 22
22
AF
On the topic of related work, Mallen et al performed a similar experiment in Eliciting Latent Knowledge from Quirky Language Models, and found similar results.
(As in this work, they did linear probing to distinguish what-model-knows from what-model-says, where models were trained to be deceptive conditional on a trigger word, and the probes weren’t trained on any examples of deception behaviors; they found that probing “works,” that middle layers are most informative, that deceptive activations “look different” in a way that can be mechanically detected w/o supervision about deception [reminiscent of the PCA observations here], etc.)
- Monte M 29 Apr 2024 22:05 UTC
  8 points
  2
  Parent
  Thanks for letting us know, I wasn’t personally aware of this paper before, it’s definitely very interesting and relevant! We’ve updated the “related work” section of the blog post to refer to it. Apologies for the oversight.