Something like ‘A Person, who is not a Librarian’ would be reasonable. Some people are librarians, and some are not.
What I do not expect to see are cases like ‘A Person, who is not a Person’ (contradictory definitions) or ‘A Person, who is not a and’ (grammatically incorrect completions).
If my prediction is wrong and it still completes with ‘A Person, who is not a Person’, that would mean it decides on that definition just by looking at the synthetic token. It would “really believe” that this token has that definition.
Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer
So trying to imagine a concrete example of this, I imagine a prompt like: “A typical definition of ‘goy’ would be: a person who is not a” and you would expect the natural completion to be ” Jew” regardless of whether attention to ” person” is suppressed (unlike in the empty-string case). Does that correctly capture what you’re thinking of? (‘goy’ is a bit awkward here since it’s an unusual & slangy word but I couldn’t immediately think of a better example)
Something like ‘A Person, who is not a Librarian’ would be reasonable. Some people are librarians, and some are not.
What I do not expect to see are cases like ‘A Person, who is not a Person’ (contradictory definitions) or ‘A Person, who is not a and’ (grammatically incorrect completions).
If my prediction is wrong and it still completes with ‘A Person, who is not a Person’, that would mean it decides on that definition just by looking at the synthetic token. It would “really believe” that this token has that definition.
Got it, that makes sense. Thanks!
So trying to imagine a concrete example of this, I imagine a prompt like: “A typical definition of ‘goy’ would be: a person who is not a” and you would expect the natural completion to be ” Jew” regardless of whether attention to ” person” is suppressed (unlike in the empty-string case). Does that correctly capture what you’re thinking of? (‘goy’ is a bit awkward here since it’s an unusual & slangy word but I couldn’t immediately think of a better example)