hold_my_fish comments on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

hold_my_fish 5 Jan 2023 17:50 UTC
2 points
0
An interesting aspect of the heuristic the model found is that it’s wrong. That’s why it’s possible to construct adversarial examples that trick the heuristic.
I think if I’m going to accuse the model’s heuristic of being “wrong” then it’s only fair that I provide an alternative. Here’s an attempt at explaining why “Mary” is the right answer to “When Mary and John went to the store, John gave a drink to”:
- John probably gives the drink to one of the people in the context (John or Mary).
- If John were the receiver, we’d usually say “John gave himself a drink”. (Probably not “John gave a drink to himself”, and never “John gave a drink to John”.)
- The only person left is Mary, so Mary is probably the receiver.
Instead the model “cheats” with a heuristic that might work quite often on the training set but doesn’t properly understand what’s going on, which makes it generalize poorly to adversarial examples.
I wonder whether this wrongness just reflects the smallness of GPT2-small, or whether it’s found in larger models too. Do the larger models get better performance because they find correct heuristics instead, or because they develop a more diverse set of wrong heuristics?