(As in this work, they did linear probing to distinguish what-model-knows from what-model-says, where models were trained to be deceptive conditional on a trigger word, and the probes weren’t trained on any examples of deception behaviors; they found that probing “works,” that middle layers are most informative, that deceptive activations “look different” in a way that can be mechanically detected w/o supervision about deception [reminiscent of the PCA observations here], etc.)
Thanks for letting us know, I wasn’t personally aware of this paper before, it’s definitely very interesting and relevant! We’ve updated the “related work” section of the blog post to refer to it. Apologies for the oversight.
On the topic of related work, Mallen et al performed a similar experiment in Eliciting Latent Knowledge from Quirky Language Models, and found similar results.
(As in this work, they did linear probing to distinguish what-model-knows from what-model-says, where models were trained to be deceptive conditional on a trigger word, and the probes weren’t trained on any examples of deception behaviors; they found that probing “works,” that middle layers are most informative, that deceptive activations “look different” in a way that can be mechanically detected w/o supervision about deception [reminiscent of the PCA observations here], etc.)
Thanks for letting us know, I wasn’t personally aware of this paper before, it’s definitely very interesting and relevant! We’ve updated the “related work” section of the blog post to refer to it. Apologies for the oversight.