While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?
If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I’d be reticent to generalize the findings to models where the sandbagging was a result from pretraining.
I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlations run very deep in pretraining but only adjust in a much more superficial task-oriented way in fine tuning which is why the latter frequently ends up so brittle.
So while it’s reassuring sandbagging has limitations if introduced by bad actors in fine tuning, there may be a false negative in discounting the threat modeling or false positive in the efficacy of the found interventions where the sandbagging ‘intent’ was introduced in pretraining.
That is a clear limitation of our experiments (and we make it explicit both in this post and in the paper).
I think our experiments are still a good reason to update (a little) even for scheming threat models:
We see similar (though weaker) results when training models from scratch in a toy task;
I don’t see a strong reason why the behavior and sample efficiency of a fine-tuned model should be radically different from behavior that arises naturally from pretraining (though I expect significant differences because fine-tuning makes password-locking way too salient). The fact that fine-tuning on 1 correct generation provides many bits of evidence against the “lock” applies just as much for our password-locked models as for scheming models.
My biggest uncertainty is if SGD will be able to find a part of a scheming LLM to modify that would weaken the lock (e.g. by making it less schemy, or giving it some urge to give good answers, or updating its world model on the general kind of information humans have access to). Sadly, I think this is the kind of uncertainty that is very hard to resolve, though some projects (such as the ones in “build better password-locked models” in the Future work section) may give some (small) evidence about this (up or down).
While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?
If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I’d be reticent to generalize the findings to models where the sandbagging was a result from pretraining.
I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlations run very deep in pretraining but only adjust in a much more superficial task-oriented way in fine tuning which is why the latter frequently ends up so brittle.
So while it’s reassuring sandbagging has limitations if introduced by bad actors in fine tuning, there may be a false negative in discounting the threat modeling or false positive in the efficacy of the found interventions where the sandbagging ‘intent’ was introduced in pretraining.
That is a clear limitation of our experiments (and we make it explicit both in this post and in the paper).
I think our experiments are still a good reason to update (a little) even for scheming threat models:
We see similar (though weaker) results when training models from scratch in a toy task;
I don’t see a strong reason why the behavior and sample efficiency of a fine-tuned model should be radically different from behavior that arises naturally from pretraining (though I expect significant differences because fine-tuning makes password-locking way too salient). The fact that fine-tuning on 1 correct generation provides many bits of evidence against the “lock” applies just as much for our password-locked models as for scheming models.
My biggest uncertainty is if SGD will be able to find a part of a scheming LLM to modify that would weaken the lock (e.g. by making it less schemy, or giving it some urge to give good answers, or updating its world model on the general kind of information humans have access to). Sadly, I think this is the kind of uncertainty that is very hard to resolve, though some projects (such as the ones in “build better password-locked models” in the Future work section) may give some (small) evidence about this (up or down).