RogerDearnaley comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

RogerDearnaley 15 Jan 2024 5:38 UTC
3 points
0
My apologies for leaping to conclusions: I was assuming you were making an error I’ve commonly seen on LW, when you were actually saying something different (that just sounded a bit similar).
However, I think this potentially might relate to why some of these trained models, especially the CoT and distilled CoT versions in the larger models, may be hard for the RL credit assignment to train out: the more they consist of a set of “library calls” into preexisting logical thinking, understanding of when to hide ones motives, standard techniques for deceit, etc, the harder it may be for the credit assignment to backtrack through all of that stuff (which disabling would actualy damage the model’s performance, so is disfavored during RL) to the small original “library call” neural circuitry that started its thinking in down that path. And, as you point out, this is for a model organism that has been handed a lot of what it needs to do on a plate during its training. If a real deceptive alignment arose, I agree, it would probably (at least initially) be even more complex. And yes, it might thus need to do more work in a single forward pass, but it might be even harder for credit assignment to backtrack. I suspect this might be an advantage of using something like DPO where the credit assignment is pure back-propagation and should travel through all layers over using RL: I’m a lot less clear how capable the credit assignment of that is (but the paper’s authors clearly don’t entirely trust it).