Lennart Buerger

Karma: 25

I studied Physics in Heidelberg and Oxford and am now doing research on AI Alignment and LLM Interpretability, currently as part of my Master’s thesis in Fred Hamprecht’s SciAI Lab (Heidelberg, Germany). If you want to discuss something, have questions or would like to collaborate, feel free to drop me a message!

Lennart Buerger Nov 7, 2024, 1:14 PM
1 point
0
on: Iterative Refinement Stages of Lying in LLMs
Really interesting work, especially the three stages you find. Regarding your question whether your classifier will generalize to negated statements: This has been explored in our Neurips 2024 paper “Truth is Universal: Robust Detection of Lies in LLMs” (https://arxiv.org/abs/2407.12831). In fact, true and false negated statements separate along a different direction than statements without negation, so a classifier would not generalize. The truthfulness representation found in this paper is also universal and can be found in multiple LLMs which is consistent with your findings :)

Lennart Buerger Aug 6, 2024, 8:29 AM
LW: 3 AF: 1
0
AF
in reply to: Kieron Kretschmar’s comment on: Truth is Universal: Robust Detection of Lies in LLMs
This is an excellent question! Indeed, we cannot rule out that $t_{G}$ is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that $t_{G}$ generalizes robustly under several significant distribution shifts. Specifically, $t_{G}$ is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a ”.” token. Despite this limited training data $t_{G}$ generalizes reasonably well to (i) unseen topics, (ii) unseen statement types, (iii) real-world scenarios, (iv) other tokens like ”!” or ”.’”. I think that the real-world scenarios (iii) are a particularly significant distribution shift. However, I agree with you that tests on many more distribution shifts are needed to be highly confident that $t_{G}$ is indeed an elementary feature (if something like that even exists).

Lennart Buerger Jul 25, 2024, 2:21 PM
LW: 1 AF: 1
0
AF
on: JumpReLU SAEs + Early Access to Gemma 2 SAEs
Nice work! I was wondering what context length you were using when you extracted the LLM activations to train the SAE. I could not find it in the paper but I might also have missed it. I know that OpenAI used a context length of 64 tokens in all their experiments which is probably not sufficient to elicit many interesting features. Do you use a variable context length or also a fixed value?

Truth is Universal: Robust Detection of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM

24 points

3 comments2 min readLW link

(arxiv.org)

Lennart Buerger

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Truth is Universal: Robust Detection of Lies in LLMs