An interesting aspect of the heuristic the model found is that it’s wrong. That’s why it’s possible to construct adversarial examples that trick the heuristic.
I think if I’m going to accuse the model’s heuristic of being “wrong” then it’s only fair that I provide an alternative. Here’s an attempt at explaining why “Mary” is the right answer to “When Mary and John went to the store, John gave a drink to”:
John probably gives the drink to one of the people in the context (John or Mary).
If John were the receiver, we’d usually say “John gave himself a drink”. (Probably not “John gave a drink to himself”, and never “John gave a drink to John”.)
The only person left is Mary, so Mary is probably the receiver.
Instead the model “cheats” with a heuristic that might work quite often on the training set but doesn’t properly understand what’s going on, which makes it generalize poorly to adversarial examples.
I wonder whether this wrongness just reflects the smallness of GPT2-small, or whether it’s found in larger models too. Do the larger models get better performance because they find correct heuristics instead, or because they develop a more diverse set of wrong heuristics?
An interesting aspect of the heuristic the model found is that it’s wrong. That’s why it’s possible to construct adversarial examples that trick the heuristic.
I think if I’m going to accuse the model’s heuristic of being “wrong” then it’s only fair that I provide an alternative. Here’s an attempt at explaining why “Mary” is the right answer to “When Mary and John went to the store, John gave a drink to”:
John probably gives the drink to one of the people in the context (John or Mary).
If John were the receiver, we’d usually say “John gave himself a drink”. (Probably not “John gave a drink to himself”, and never “John gave a drink to John”.)
The only person left is Mary, so Mary is probably the receiver.
Instead the model “cheats” with a heuristic that might work quite often on the training set but doesn’t properly understand what’s going on, which makes it generalize poorly to adversarial examples.
I wonder whether this wrongness just reflects the smallness of GPT2-small, or whether it’s found in larger models too. Do the larger models get better performance because they find correct heuristics instead, or because they develop a more diverse set of wrong heuristics?