Bogdan Ionut Cirstea comments on Interpreting the Learning of Deceit

Bogdan Ionut Cirstea 29 Dec 2023 22:33 UTC
1 point
2
Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
- Bogdan Ionut Cirstea 2 Mar 2024 0:39 UTC
  3 points
  0
  Parent
  Also relevant: Language Models Represent Beliefs of Self and Others.
  - RogerDearnaley 20 Oct 2024 20:21 UTC
    4 points
    0
    Parent
    That’s a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it’s about to produce — which would fit with my proposed pattern of layer usage.
- RogerDearnaley 20 Oct 2024 20:37 UTC
  2 points
  0
  Parent
  A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They’re probing yes-no questions of fact, where assembling the lie seems trivial (it’s just a NOT gate), but lying is generally a good deal more complex than that.