Fabien Roger comments on [Paper] Stress-testing capability elicitation with password-locked models

Fabien Roger 5 Jun 2024 0:21 UTC
3 points
0
That is a clear limitation of our experiments (and we make it explicit both in this post and in the paper).
I think our experiments are still a good reason to update (a little) even for scheming threat models:
- We see similar (though weaker) results when training models from scratch in a toy task;
- I don’t see a strong reason why the behavior and sample efficiency of a fine-tuned model should be radically different from behavior that arises naturally from pretraining (though I expect significant differences because fine-tuning makes password-locking way too salient). The fact that fine-tuning on 1 correct generation provides many bits of evidence against the “lock” applies just as much for our password-locked models as for scheming models.
My biggest uncertainty is if SGD will be able to find a part of a scheming LLM to modify that would weaken the lock (e.g. by making it less schemy, or giving it some urge to give good answers, or updating its world model on the general kind of information humans have access to). Sadly, I think this is the kind of uncertainty that is very hard to resolve, though some projects (such as the ones in “build better password-locked models” in the Future work section) may give some (small) evidence about this (up or down).