paulfchristiano comments on Trying to disambiguate different questions about whether RLHF is “good”

paulfchristiano 16 Dec 2022 21:25 UTC
LW: 4 AF: 3
1
AF
I’m also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion “fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both.”