Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer
So trying to imagine a concrete example of this, I imagine a prompt like: “A typical definition of ‘goy’ would be: a person who is not a” and you would expect the natural completion to be ” Jew” regardless of whether attention to ” person” is suppressed (unlike in the empty-string case). Does that correctly capture what you’re thinking of? (‘goy’ is a bit awkward here since it’s an unusual & slangy word but I couldn’t immediately think of a better example)
Got it, that makes sense. Thanks!
So trying to imagine a concrete example of this, I imagine a prompt like: “A typical definition of ‘goy’ would be: a person who is not a” and you would expect the natural completion to be ” Jew” regardless of whether attention to ” person” is suppressed (unlike in the empty-string case). Does that correctly capture what you’re thinking of? (‘goy’ is a bit awkward here since it’s an unusual & slangy word but I couldn’t immediately think of a better example)