ryan_greenblatt comments on Modulating sycophancy in an RLHF model via activation steering

ryan_greenblatt 2 Jan 2024 20:15 UTC
LW: 2 AF: 1
0
AF
On sample efficiency and generalization more broadly, I now overall think something like:
- Using contrast pairs for variance reduction is a useful technique for improving sample efficiency. (And I was foolish to not understand this was part of the method in this post.)
- I’m unsure what is the best way to use contrast pairs to maximize sample efficiency. It seem plausible to me that it will be something like activation addition, but I could also imagine DPO or some variant of DPO working better in practice. It would be interesting to see further work comparing these methods and also trying to do ablations to understand where the advantages of the best methods come from. (That said, while this is interesting, I’m unsure how important it is to improve sample efficiency from an AI x-safety perspective.)
- I don’t think any of the generalization results I see in the linked post are very interesting as the generalizations don’t feel importantly analogous to generalizations I care about. (I think all the results are on generalization from multiple choice question answering to free response?) I’d be more excited about generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post for more discussion of the setting I’m thinking about.
- ryan_greenblatt 8 Jan 2024 18:54 UTC
  LW: 4 AF: 3
  2
  AF Parent
  Due to the results noted in in TurnTrout’s comment here from Liu et al., I now don’t think the action is mostly coming from contrast pairs (in at least some cases).
  
  So, there is higher sample efficiency for activation engineering stuff over LoRA finetuning in some cases.^[1]
  
  (Though it feels to me like there should be some more principled SGD style method which captures the juice.)
  ↩︎
  Up to methodological error in learning rates etc.