Wuschel Schulz comments on What’s up with all the non-Mormons? Weirdly specific universalities across LLMs

Wuschel Schulz 21 Apr 2024 13:18 UTC
7 points
3
13. an X that isn’t an X
I think this pattern is common because of the repetition. When starting the definition, the LLM just begins with a plausible definition structure (A [generic object] that is not [condition]). Lots of definitions look like this. Next it fills in some common [gneric object].Then it wants to figure out what the specific [condition] is that the object in question does not meet. So it pays attention back to the word to be defined, but it finds nothing. There is no information saved about this non-token. So the attention head which should come up with a plausible candidate for [condition] writes nothing to the residual stream. What dominates the prediction now are the more base-level predictive patterns that are normally overwritten, like word repetition (this is something that transformers learn very quickly and often struggle with overdoing). The repeated word that at least fits grammatically is [generic object], so that gets predicted as the next token.
Here are some predictions I would make based on that theory:
- When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.
- When you look (with logit lens) at which layer the transformer decides to predict [generic object] as the last token, it will be a relatively early layer.
- Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer.
- eggsyntax 24 Apr 2024 14:53 UTC
  1 point
  0
  Parent
  ‘When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.’
  
  Can you unpack what you mean by ‘a reasonable condition’ here?
  - Wuschel Schulz 24 Apr 2024 15:50 UTC
    1 point
    0
    Parent
    Something like ‘A Person, who is not a Librarian’ would be reasonable. Some people are librarians, and some are not.
    What I do not expect to see are cases like ‘A Person, who is not a Person’ (contradictory definitions) or ‘A Person, who is not a and’ (grammatically incorrect completions).
    If my prediction is wrong and it still completes with ‘A Person, who is not a Person’, that would mean it decides on that definition just by looking at the synthetic token. It would “really believe” that this token has that definition.
    - eggsyntax 24 Apr 2024 16:26 UTC
      1 point
      0
      Parent
      Got it, that makes sense. Thanks!
      Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer
      So trying to imagine a concrete example of this, I imagine a prompt like: “A typical definition of ‘goy’ would be: a person who is not a” and you would expect the natural completion to be ” Jew” regardless of whether attention to ” person” is suppressed (unlike in the empty-string case). Does that correctly capture what you’re thinking of? (‘goy’ is a bit awkward here since it’s an unusual & slangy word but I couldn’t immediately think of a better example)