ryan_greenblatt comments on Modulating sycophancy in an RLHF model via activation steering

ryan_greenblatt Jan 8, 2024, 6:34 PM
LW: 4 AF: 3
−2
AF
I was being foolish, the vectors are averaged across a dataset, but there are still positive vs negative contrast pairs, so we should see sample efficiency improvements from contrast pairs (it is generally the case that contrast pairs are more sample efficient). That said, I’m unsure if simple techniques like DPO are just as sample efficient when using these contrast pairs.

[Note: I originally made this as an edit to the parent, but this was confusing. So I moved it to a separate comment.]
- ryan_greenblatt Jan 8, 2024, 6:37 PM
  LW: 4 AF: 3
  0
  AF Parent
  I’m now less sure that contrast pairs are important and I’m broadly somewhat confused about what has good sample efficiency and why.
  - TurnTrout Jan 8, 2024, 6:40 PM
    LW: 4 AF: 3
    0
    AF Parent
    Right. Liu et al provide evidence against the contrast pairs being crucial (with “unmatched” meaning they just sample independently from the positive and negative contrast pair distributions):
    And even the unmatched condition would still indicate better sample efficiency than prompting or finetuning:
    What links here?
    ryan_greenblatt's comment on Modulating sycophancy in an RLHF model via activation steering by Nina Panickssery (Jan 8, 2024, 6:54 PM; 4 points)