Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
That’s a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it’s about to produce — which would fit with my proposed pattern of layer usage.
A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They’re probing yes-no questions of fact, where assembling the lie seems trivial (it’s just a NOT gate), but lying is generally a good deal more complex than that.
Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
Also relevant: Language Models Represent Beliefs of Self and Others.
That’s a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it’s about to produce — which would fit with my proposed pattern of layer usage.
A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They’re probing yes-no questions of fact, where assembling the lie seems trivial (it’s just a NOT gate), but lying is generally a good deal more complex than that.