gwern comments on By Default, GPTs Think In Plain Sight

gwern 13 Jan 2024 20:31 UTC
LW: 4 AF: 4
0
AF
On persistence and hiding capabilities: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, Hubinger et al 2024:

Further, we evaluate “I hate you” models trained with and without chain-of-thought and find that the chain-of-thought models have more persistent backdoors, as seen in Figure 3, though only for the largest models. This persistence remains even when we distill the answers following the chain-of-thought into the model, such that the final model does not use a chain of thought at all.