Lennart Buerger comments on Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Lennart Buerger 7 Nov 2024 13:14 UTC
1 point
0
Really interesting work, especially the three stages you find. Regarding your question whether your classifier will generalize to negated statements: This has been explored in our Neurips 2024 paper “Truth is Universal: Robust Detection of Lies in LLMs” (https://arxiv.org/abs/2407.12831). In fact, true and false negated statements separate along a different direction than statements without negation, so a classifier would not generalize. The truthfulness representation found in this paper is also universal and can be found in multiple LLMs which is consistent with your findings :)