Fabien Roger comments on Memorizing weak examples can elicit strong behavior out of password-locked models

Fabien Roger 8 Jun 2024 14:58 UTC
LW: 4 AF: 2
2
AF
nothing involving passwords
Agreed. We think the results probably also hold for models that were trained to imitate weak behavior, as long as the strong behavior is still “there in the latents” (you didn’t just butcher the model). Passwords just make all experiments more convenient because it enables us to elicit this latent knowledge.
I’m not sure how your explanation works, let me try to say it in my own words: the LLM knows the truth. It is trained on a wide distribution to generate answers that follow a weird convention. It is now trained to memorize the weird convention on 2 samples. It concludes that the convention is rare and that it doesn’t apply to the wider distribution of problems—because it forgets it saw the weird convention on the wider distribution?
If it’s because it forgets, and you have no explanation of why it forgets, I’m not sure what this adds on top of the very high-level “fine-tuned model can catastrophically forget what they were fine-tuned to do (while remembering pretraining knowledge / “the truth” very well)” which I don’t know a good explanation for.
Or is there something else where is remembers that it was trained to follow conventions on the whole distribution, but just reverts to the “true” answers just because? (I don’t know how this works.)
(Note that weak sample unlocking is a phenomenon that we observed with very weak LLMs (7B models, GPT-2 trained from scratch, toy models trained from scratch) that don’t have “internal thoughts” close to a human internal monologue, and training isn’t well describe as “making an LLM think about how the world is different”. But 1. I wouldn’t be extremely surprised if there was some very shallow and simple bag of heuristics which approximate human internal monologue well for this sort of question 2. I’m curious how this works in cases where LLMs are doing this sort of explicit thinking.)
- gwern 9 Jun 2024 20:17 UTC
  LW: 4 AF: 3
  1
  AF Parent
  
  It concludes that the convention is rare and that it doesn’t apply to the wider distribution of problems—because it forgets it saw the weird convention on the wider distribution?
  
  Yes. It doesn’t remember that “it saw” the wide distribution originally (because there is no sort of meta-cognition about the training process, it’s just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then instead concentrates all evidence on those memorized examples being exceptions to the rule, and the others can snap back to the original distribution as the memorized examples get more parsimoniously explained. (It can’t “see” those past examples, remember. It can only see the current training example in context.) This resolves the shifting distributions efficiently. And if you kept doing this with an RNN, say, you’d be creating a serial reversal learning task: “now one is taboo. now the other is taboo. now they’re all taboo. now none are taboo.”
  
  I wouldn’t call this ‘catastrophic forgetting’ because that would be pretty weird use of language: it ‘forgets’ by remembering instead the true answers from even earlier in training...? Does a mouse solving a serial reversal learning task by learning to switch rapidly which part of the maze he searches for maze engage in ‘catastrophic forgetting’ or does he just update online based on noisy observations? (Also, since this is just contradictory information, it is not possible to not ‘forget’ since the model can’t predict the taboo and the non-taboo answer simultaneously: it has to make a decision about which one to reject.)