Thanks, it’s fixed!
Alexandre Variengien
Gliders in Language Models
Thanks for your comment!
1.Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I’m confused. You say “duplicating the IO token in a distractor sentence”, but I thought David would be the IO here?
Am I confused about the meaning of the IO or was there just a typo in the example?
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”
When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.
Thanks a lot for spotting the typo, we corrected the post!
2.I’d love if you could expand on this (maybe with an example). It sounds like you’re implying that the circuit you found is not complete?
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.
In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.
Thanks for the feedback!
Does this mean that it writes a projection of S1′s positional embedding to S2′s residual stream? Or is it meant to say “writing to the position [residual stream] of [S2]”? Or something else?
Our current hypothesis is that they write some information about S1′s position (that we called the “position signal”, not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph “Locating the position signal.” in section 3.3). I hope this answer your questions.
We currently think that the position signal is a relative pointer from S2 to S1, computed by the difference between the positions S2 and S1. However, our evidence for this claim is quite small (see the last paragraph of Appendix A).
That’s definitely an exciting direction for future research!
I agree with this. I think that the most useful part of the concept is to force making the difference between the “superficial transformations” and the “things that stays”.
I also think that it’s useful to think about text features that are not (or unlikely to be) gliders like
The tone of a memorized quote
A random date chosen to fill a blank in an administrative report
The characters in a short story, part of a list of short stories. In general, every feature coming before a strong context switch is unlikely to be transmitted further.