Alexandre Variengien

Karma: 641

Alexandre Variengien Nov 25, 2022, 11:23 PM
3 points
2
in reply to: janus’s comment on: Gliders in Language Models
This is an important point, but it also highlights how the concept of gliders is almost tautological. Any sequence of entangled causes and effects could be considered a glider, even if it undergoes superficial transformations.
I agree with this. I think that the most useful part of the concept is to force making the difference between the “superficial transformations” and the “things that stays”.
I also think that it’s useful to think about text features that are not (or unlikely to be) gliders like
- The tone of a memorized quote
- A random date chosen to fill a blank in an administrative report
- The characters in a short story, part of a list of short stories. In general, every feature coming before a strong context switch is unlikely to be transmitted further.

Alexandre Variengien Nov 25, 2022, 6:01 PM
3 points
0
in reply to: Gunnar_Zarncke’s comment on: Gliders in Language Models
Thanks, it’s fixed!

Gliders in Language Models

Alexandre VariengienNov 25, 2022, 12:38 AM

30 points

11 comments10 min readLW link

Alexandre Variengien Nov 1, 2022, 5:04 PM
2 points
0
in reply to: Joel Burget’s comment on: Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
Thanks for your comment!

1.
Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I’m confused. You say “duplicating the IO token in a distractor sentence”, but I thought David would be the IO here?
Am I confused about the meaning of the IO or was there just a typo in the example?
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”

When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.

Thanks a lot for spotting the typo, we corrected the post!

2.
I’d love if you could expand on this (maybe with an example). It sounds like you’re implying that the circuit you found is not complete?
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.

In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.
There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.

Alexandre Variengien Oct 29, 2022, 6:44 PM
1 point
0
in reply to: Vivek Hebbar’s comment on: Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
Thanks for the feedback!
Does this mean that it writes a projection of S1′s positional embedding to S2′s residual stream? Or is it meant to say “writing to the position [residual stream] of [S2]”? Or something else?
Our current hypothesis is that they write some information about S1′s position (that we called the “position signal”, not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph “Locating the position signal.” in section 3.3). I hope this answer your questions.
We currently think that the position signal is a relative pointer from S2 to S1, computed by the difference between the positions S2 and S1. However, our evidence for this claim is quite small (see the last paragraph of Appendix A).
That’s definitely an exciting direction for future research!

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

Oct 28, 2022, 11:55 PM

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Apply to the Machine Learning For Good bootcamp in France

Alexandre VariengienJun 17, 2022, 7:32 AM

10 points

0 comments1 min readLW link

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge

Alexandre VariengienMay 27, 2022, 5:58 PM

17 points

0 comments16 min readLW link

Alexandre Variengien

Gliders in Lan­guage Models

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

Ap­ply to the Ma­chine Learn­ing For Good boot­camp in France

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Gliders in Language Models

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Apply to the Machine Learning For Good bootcamp in France

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge