gwern comments on Memorizing weak examples can elicit strong behavior out of password-locked models

gwern 9 Jun 2024 20:17 UTC
LW: 4 AF: 3
1
AF

It concludes that the convention is rare and that it doesn’t apply to the wider distribution of problems—because it forgets it saw the weird convention on the wider distribution?

Yes. It doesn’t remember that “it saw” the wide distribution originally (because there is no sort of meta-cognition about the training process, it’s just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then instead concentrates all evidence on those memorized examples being exceptions to the rule, and the others can snap back to the original distribution as the memorized examples get more parsimoniously explained. (It can’t “see” those past examples, remember. It can only see the current training example in context.) This resolves the shifting distributions efficiently. And if you kept doing this with an RNN, say, you’d be creating a serial reversal learning task: “now one is taboo. now the other is taboo. now they’re all taboo. now none are taboo.”

I wouldn’t call this ‘catastrophic forgetting’ because that would be pretty weird use of language: it ‘forgets’ by remembering instead the true answers from even earlier in training...? Does a mouse solving a serial reversal learning task by learning to switch rapidly which part of the maze he searches for maze engage in ‘catastrophic forgetting’ or does he just update online based on noisy observations? (Also, since this is just contradictory information, it is not possible to not ‘forget’ since the model can’t predict the taboo and the non-taboo answer simultaneously: it has to make a decision about which one to reject.)