ryan_greenblatt comments on Modulating sycophancy in an RLHF model via activation steering

ryan_greenblatt 8 Jan 2024 18:54 UTC
LW: 4 AF: 3
2
AF
Due to the results noted in in TurnTrout’s comment here from Liu et al., I now don’t think the action is mostly coming from contrast pairs (in at least some cases).

So, there is higher sample efficiency for activation engineering stuff over LoRA finetuning in some cases.^[1]

(Though it feels to me like there should be some more principled SGD style method which captures the juice.)
1. ↩︎
  Up to methodological error in learning rates etc.