Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I’m confused. You say “duplicating the IO token in a distractor sentence”, but I thought David would be the IO here?
Am I confused about the meaning of the IO or was there just a typo in the example?
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”
When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.
Thanks a lot for spotting the typo, we corrected the post!
2.
I’d love if you could expand on this (maybe with an example). It sounds like you’re implying that the circuit you found is not complete?
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.
In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.
There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.
Thanks for your comment!
1.
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”
When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.
Thanks a lot for spotting the typo, we corrected the post!
2.
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.
In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.
There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.