jsd comments on unRLHF—Efficiently undoing LLM safeguards

jsd 16 Oct 2023 5:41 UTC
LW: 2 AF: 2
0
AF
I’d be curious about how much more costly this attack is on LMs Pretrained with Human Preferences (including when that method is only applied to “a small fraction of pretraining tokens” as in PaLM 2).