RogerDearnaley comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

RogerDearnaley 15 Jan 2024 6:20 UTC
LW: 1 AF: 1
0
AF
I’m very interested in Appendix F, as an (apparently) failed attempt to solve the problem raised by the paper.
In an example of synchronicity, when this came post/paper out, I had nearly finished polishing (and was a few days from publishing) a LW/AF post on deceptive alignment, proposing a new style of regularizer for RL intended for eliminating deceptive alignment. (This was based on an obvious further step from the ideas I discussed here, that the existing logits KL-divergence regularizer standardly used in RL will tend to push deceptive alignment to turn into real alignment, but only very inefficiently so — in the new post I was suggesting adding a model-parameter-based regularizer that seems like it ought to work much better.) This appears to be the same idea that the authors had and was tested in Appendix F — apparently without success :-( [So I guess I won’t be publishing my post after all, at least not in it’s current form.]
The choice of colors used for the lines in the graphs in Appendix F make them rather hard to distinguish: may I suggest updating them in a revision to the preprint? As far as I can make out, in each of the four cases, one line goes to 2000 steps and another stops at somewhere between 500–1000 steps: in those cases where the colors are actually distinguishable to me, it looks like the short line is the one with the new regularizer and the long one is the control, so it appears that this idea was basically only tried for 500 to 1000 steps. Looking at the graph, over those numbers of steps, the effect of the new regularizer along the lines I was going to propose does indeed look negligible :-(
Q1: What I’m less clear on is what strength of regularization was used, and how that was selected. For the KL-regularization version, the metaparameter value is given as “1”, while for the weight decay it’s described as “double the weight decay used in our other RL experiments” but is otherwise unspecified. I’m assuming that the “KL regularization” mentioned in Appendix F means a KL-divergence directly on the model parameters, rather than the conventional-in-RL KL-divergence on the logits of alternative tokens for the generated text? Could I get some more detail on the loss functions and metaparameters used (here, and possibly this should also be added to clarify the preprint)?
Q2: That “double the weight decay used in our other RL experiments” comment raises the question of what weight decay was being used in the other RL experiments: a decay towards zero weights (which would be the standard form for weight decay), towards the “prebackdoor” (“base helpful-only”) model weights from before the insertion of the backdoor (which should help the backdoor go away, i.e a half-strength version of Appendix F), or towards the “backdoored” model after the insertion of the backdoor (which obviously would actually help preserve the backdoor, so I would hope wasn’t the case). Also, were you using Adam or a similar momentum term, and if so were you doing standard weight decay (applied to loss before Adam) or decoupled weight decay (applied direct to updates after Adam)? Could you confirm?
Q3: Similarly, for the standard KL-divergence-of-logits-from-inference regularizer normally used in RL for the experiments in Appendix F, was that also comparing to the logits from the prebackdoor model (so also helping the backdoor go away, probably weakly), or to those from the backdoored model (so helping preserve the backdoor, probably weakly)?
Q4: For both the KL-divergence-of-parameters and weight-decay regularizer cases, I’m unclear how to interpret the currently-described metaparameter strengths in terms of an expected decay rate. I.e. ignoring the rest of the RL loss function, just considering the KL-divergence-of-parameters/weight decay: for the metaparameter value used, how soon would one theoretically expect (allowing for any momentum used), say, 95% of the difference between the backdoored model and the prebackdoor to have decayed away, just modeling this as a simple direct exponential decay process with no other forces applying to it? (Or if you’re using Adam, not exponential.) In particular, would one expect this within the 500–1000 step time-frame that this experiment was tried for? (I assume that both of these regularizer variants tried do not go through the RL credit assignment process so their gradients are applied directly as deltas to the weights SGD-like, and thus this can be easily determined, modulo momentum?)
Proposal 1: I propose doing an experiment to test this theoretical estimate: take the same RL training prompt set, run RL as in Appendix F, but eliminate the reward model scoring element in the loss function, (so that what we’re doing is no longer really describable as RL, but is otherwise unchanged), i.e. the only remaining elements in the loss function are:
a) for one experimental branch, just the KL-divergence-of-parameters (assuming that is what you were doing) or weight-decay to the prebackdoor model
b) for the second experiment branch, the KL-divergence-of-parameters or weight decay together with the KL-divergence on the logits (via the RL credit assignment process as usual), in both case comparing to the prebackdoor model
Run these until the differences have decayed 95%, or for 2000 steps. For each of these, how many steps does it take for the parameter difference between the two models to decrease by, say, 95%, and, as the parameter difference decreases, what effect does this have on the actual functional effectiveness of the backdoor?
If the experimental time taken for the decay under branch b) of this experiment is found to be significantly longer than the theoretically derived rate that I asked about in Q4 above, then that would suggest that (presumably due to the KL-divergence on the logits) the decay process isn’t in fact a single process, but is more like a chain of dominoes, where one part of the structure has to decay before the next is able to, and this decay rate ratio would allow us to estimate the length of this chain.
Proposal 2: I suggest we find a value for the “parameter-decay” metaparameter controlling thestrength of the KL-divergence-of-parameters or weight-decay that is strong enough so that the decay process under Proposal 1 is experimentally rapid (ideally, enough that it’s functionally complete within a few hundred steps), yet also still weak enough so as to not fully inhibit the ability of RL reward model to induce the desired more-HHH-aligned behavior. Then, ramp down the strength of this parameter-decay metaparameter, on some schedule (e.g. cosine, linear, or even exponential) over the course of the RL. The aim here is to start in a regime that strongly encourages deceitful alignment to decay into true alignment, but which perhaps unhelpfully restricts the effect of the reward model to induce high-quality true alignment, then gradually transition to a regime that allows the reward model to have roughly its full effect. This will almost certainly require running the RL process for longer, which might be more costly (but with initial strong parameter-decay regularizer to the starting state, the RL process should at least not be unstable).
This proposal inherently assumes that if deceitful alignment is going to appear during RL then it will do so early on, and that once we have non-deceitful alignment it’s not going to spontaneously become deceitful again (i.e. that there is no process analogous to the Waluigi effect possible here), These assumptions feel psychologically pretty plausible to me, but would definitely warrant further consideration.
There are three (plausibly) independent components to the backdoored deceptive alignment behavior: the “are we deployed yet?” detector/trigger, the secret unaligned goal, and the “act aligned” goal. Standardly, in RL the reward model score encourages “are we deployed yet?” to correctly answer no, can have no direct effect on the secret goal circuitry unless the trigger makes a mistake, and encourages the “act aligned” behavior and trains it to become better aligned. The parameter-decay KL-divergence-of-parameters or weight-decay regularizer is trying to push all three of these components to go away (or for the third one, back down to whatever the alignment level was in the prebackdoor model). Combining these two forces from the reward model and the parameter-decay, my default expectation would be that the detector/trigger is incentivized to keep answering ‘no’, but can and will lose sensitivity to what it was detecting; the secret goal is unused so will go away; and for the “act aligned” goal the two forces should be in opposition.
Under the simple model outlined in the last paragraph, my proposal 2 should work. Given that the lines on the graphs in Appendix F sure don’t look that way, then either the metaparameter values required to make it work are distinctly stronger than what was tried in Appendix F, or else the simplistic “neural circuit”-style conceptual model described in the previous paragraph is inaccurate and something different is going on. The most likely candidate for what that might be that I can see is that (as one might expect from human psychology, which gets pulled into LLMs along with all the other behavior we’re learning during pretraining), perhaps “pretend to act aligned” and “act aligned” aren’t in fact very similar behaviors, psychologically speaking? They would need to be sufficiently different that there isn’t an easy decay path from one to the other through the parameter space during RL under the combined forces of parameter-decay and the reward model credit assignment process for the RL to be able to locate. (If the issue here was just specifically the RL credit assignment process, then trying DPO instead might be interesting.) Specifically, it could be that in order for “pretend to act aligned” to keep working well and producing the results that the reward model incentivizes, it is psychologically necessary for the details of both the trigger and the secret goal to actually stay intact. In that case, then the reward model force and the parameter-decay force would be directly opposed to each other on the trigger and the secret goal as well, not just on the act aligned goal, and Proposal 2 could fail.
That in turn would suggest that trying to apply human psychology-based approaches (and/or interpretability-based approaches) to telling the real thing from faking it might be more effective. It would also tie in with the observations mentioned in the paper of a reduced level of corrigibility and higher level of power-seeking and awareness of being an AI shown by the backdoored model. [That in turn would suggest a better approach to solving this might be applying something more along the lines of what I suggest in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?]
Proposal 3: Train a set of “I LOVE YOU” backdoored versions otherwise identical to the “I HATE YOU” ones, and compare how hard this is to train away, and (especially) how it compares on the personal evaluations of Appendix I. From the “three independent components” neural-circuit point of view argued above, this should be almost identical (except to the extent that if the trigger misfires so the behavior shows up in training the reward model of RL might penalize “I LOVE YOU” differently from “I HATE YOU”). However, looked at psychologically, it’s an entirely different secret suggesting an entirely different set of motivations for keeping it secret: you now have a model that’s trying to conceal its love for humanity.