Linda Linsefors comments on Interpreting the Learning of Deceit

Linda Linsefors 29 Dec 2023 20:15 UTC
LW: 4 AF: 3
0
AF
Current Interpretability results suggest that roughly the first half of the layers in an LLM correspond to understanding the context at increasingly abstract levels, and the second half to figuring out what to say and turning that back from abstractions into concrete tokens. It’s further been observed that in the second half, figuring out what to say generally seems to occur in stages: first working out the baseline relevant facts, then figuring out how to appropriately slant/color those in the current context, then converting these into the correct language, and last getting the nitty-gritty details of tokenization right.
How do we know this? This claim seems plausible, but also I did not know that mech-interp was advanced enough to verify something like this. Where can I read more?
- Bogdan Ionut Cirstea 29 Dec 2023 22:33 UTC
  1 point
  2
  Parent
  Some relevant references for the claim, especially w.r.t. the interpretability of lying models: https://twitter.com/jam3scampbell/status/1729981510588027173 (and the whole thread), ‘The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking.’ (from https://arxiv.org/abs/2307.09476).
  - Bogdan Ionut Cirstea 2 Mar 2024 0:39 UTC
    3 points
    0
    Parent
    Also relevant: Language Models Represent Beliefs of Self and Others.
    - RogerDearnaley 20 Oct 2024 20:21 UTC
      4 points
      0
      Parent
      That’s a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it’s about to produce — which would fit with my proposed pattern of layer usage.
  - RogerDearnaley 20 Oct 2024 20:37 UTC
    2 points
    0
    Parent
    A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They’re probing yes-no questions of fact, where assembling the lie seems trivial (it’s just a NOT gate), but lying is generally a good deal more complex than that.
- RogerDearnaley 29 Dec 2023 22:24 UTC
  LW: 1 AF: 1
  0
  AF Parent
  An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I’m afraid I don’t recall the location, nor was I able to find it when I was writing this, which is why I didn’t add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I’ve seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):
  In summary, the general pattern of observations across layers suggests a rough layout where early layers “de-tokenize,” mapping tokens to fairly concrete concepts (phrases like “machine learning” or words when used in a specific language), the middle of the network deals in more abstract concepts such as “any clause that describes music,” and the later portions of the network “re-tokenize,” converting concrete concepts back into literal tokens to be output. All of this is very preliminary and requires much more detailed study to draw solid conclusions. However, our experience in vision was that having a sense of what kinds of features tend to exist at different layers was very helpful as high-level orientation for understanding models (see especially ). It seems promising that we may be developing something similar here.
  which is not quite the same thing, though there is some resemblance. There’s also a relation to the encoding and decoding concepts of sections 2 and 3 of the recent more theoretical paper White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, though that doesn’t make it clear that equal numbers of layers are required. (That also explains why the behavior of so-called “decoder-only” and “encoder-decoder” transformer models are so similar.)
  The “baseline before applying bias” part was I think from one of the papers on lie detection, latent knowledge extraction and/or bias, of which there have been a whole series this year, some from Paul Christiano’s team and some from others.
  On where to read more, I’d suggest starting with the Anthropic research blog where they discuss their research papers for the last year or so: roughly 40% of those are on mechanistic interpretability, and there’s always a blog post summary for a science-interested-layman reader with a link to the actual paper. There’s also some excellent work coming from other places, such as Neel Nanda, who similarly has a blog website, and the ELK work under Paul Christiano. Overall we’ve made quite a bit of progress on interpretability in the last 18 months or so, though there’s still a long way to go.
- [ ]
  [deleted]