ryan_greenblatt comments on What’s the theory of impact for activation vectors?

ryan_greenblatt 12 Feb 2024 18:53 UTC
3 points
−1
I notice that I am confused by the lack of specificity in:

not only significantly improve the model’s performance, but stack benefits with normal approaches like prompting and finetuning

and

I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities

I have some general view like “well-optimized online RLHF (which will occur by default, though it’s by no means easy) is a very strong baseline for getting average case performance which looks great to our human labelers”. So, I want to know exactly what problem you’re trying to solve. (What will quantitatively improve?)

(By well-optimized online RLHF, I mean something like: train with SFT, do RL on top of that, and continue doing RL online to ensure we get average case good labeler judgements as the distribution shifts.)

But there are two specific reasons why a method might be able to beat online RLHF:
- Exploration issues (or failures of SGD) mean that we don’t find the best stuff. (And the other method avoids this problem e.g. because it just directly effects the model rather than requiring exploring into good stuff.)^[1]
- Our labelers fail to give very accurate labels.
So, to beat RLHF, you’ll need to improve one of these issues. (It’s possible you reject this frame. E.g., because you think that well-optimized online RLHF is unlikely to be run even if it works.)

One way to do so is to get a better training method (a method that maps from training data to an updated model including a model) which either has:
- Better sample efficiency (which helps with exploration because finding a smaller number of good things sufficies for avoiding exploration issues and helps with labeling because we can use a smaller amount of higher quality labels)
- “Better” “OOD generalization”^[2] which helps with labeling because we can (e.g.) only label on easy cases and then generalize to hard cases. (See also here.)
Are you predicting better sample efficiency (against competitive baselines) or better OOD generalization? (Or do you reject this frame?)
1. ↩︎
  The most concerning cases are probably intentional sandbagging, though I currently feel pretty good at avoiding this issue with current GPT-style architecture.
2. ↩︎
  OOD generalization is a bit of a leaky/confusing abstraction. For instance, OOD behavior is probably a combination of sample efficiency and “true generalization”. And, with a tiny KL penalty nothing is technically fully OOD.