Or is the hope just that learning more about how neural networks work will allow us to theorize better about how to control them?
Activation vectors directly let us control models more effectively. There’s good evidence that on alignment-relevant metrics like {truthfulness, hallucination rate, sycophancy, power-seeking answers, myopia correlates}, activation vectors not only significantly improve the model’s performance, but stack benefits with normal approaches like prompting and finetuning. It’s another tool in the toolbox.
If results bear out, I think activation vectors will become best practice. (Perhaps like how KV-caching is common practice for faster inference.) I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities, while falling short of “perfect control.”
I notice that I am confused by the lack of specificity in:
not only significantly improve the model’s performance, but stack benefits with normal approaches like prompting and finetuning
and
I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities
I have some general view like “well-optimized online RLHF (which will occur by default, though it’s by no means easy) is a very strong baseline for getting average case performance which looks great to our human labelers”. So, I want to know exactly what problem you’re trying to solve. (What will quantitatively improve?)
(By well-optimized online RLHF, I mean something like: train with SFT, do RL on top of that, and continue doing RL online to ensure we get average case good labeler judgements as the distribution shifts.)
But there are two specific reasons why a method might be able to beat online RLHF:
Exploration issues (or failures of SGD) mean that we don’t find the best stuff. (And the other method avoids this problem e.g. because it just directly effects the model rather than requiring exploring into good stuff.)[1]
Our labelers fail to give very accurate labels.
So, to beat RLHF, you’ll need to improve one of these issues. (It’s possible you reject this frame. E.g., because you think that well-optimized online RLHF is unlikely to be run even if it works.)
One way to do so is to get a better training method (a method that maps from training data to an updated model including a model) which either has:
Better sample efficiency (which helps with exploration because finding a smaller number of good things sufficies for avoiding exploration issues and helps with labeling because we can use a smaller amount of higher quality labels)
“Better” “OOD generalization”[2] which helps with labeling because we can (e.g.) only label on easy cases and then generalize to hard cases. (See also here.)
Are you predicting better sample efficiency (against competitive baselines) or better OOD generalization? (Or do you reject this frame?)
The most concerning cases are probably intentional sandbagging, though I currently feel pretty good at avoiding this issue with current GPT-style architecture.
OOD generalization is a bit of a leaky/confusing abstraction. For instance, OOD behavior is probably a combination of sample efficiency and “true generalization”. And, with a tiny KL penalty nothing is technically fully OOD.
Activation vectors directly let us control models more effectively. There’s good evidence that on alignment-relevant metrics like {truthfulness, hallucination rate, sycophancy, power-seeking answers, myopia correlates}, activation vectors not only significantly improve the model’s performance, but stack benefits with normal approaches like prompting and finetuning. It’s another tool in the toolbox.
If results bear out, I think activation vectors will become best practice. (Perhaps like how KV-caching is common practice for faster inference.) I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities, while falling short of “perfect control.”
I notice that I am confused by the lack of specificity in:
and
I have some general view like “well-optimized online RLHF (which will occur by default, though it’s by no means easy) is a very strong baseline for getting average case performance which looks great to our human labelers”. So, I want to know exactly what problem you’re trying to solve. (What will quantitatively improve?)
(By well-optimized online RLHF, I mean something like: train with SFT, do RL on top of that, and continue doing RL online to ensure we get average case good labeler judgements as the distribution shifts.)
But there are two specific reasons why a method might be able to beat online RLHF:
Exploration issues (or failures of SGD) mean that we don’t find the best stuff. (And the other method avoids this problem e.g. because it just directly effects the model rather than requiring exploring into good stuff.)[1]
Our labelers fail to give very accurate labels.
So, to beat RLHF, you’ll need to improve one of these issues. (It’s possible you reject this frame. E.g., because you think that well-optimized online RLHF is unlikely to be run even if it works.)
One way to do so is to get a better training method (a method that maps from training data to an updated model including a model) which either has:
Better sample efficiency (which helps with exploration because finding a smaller number of good things sufficies for avoiding exploration issues and helps with labeling because we can use a smaller amount of higher quality labels)
“Better” “OOD generalization”[2] which helps with labeling because we can (e.g.) only label on easy cases and then generalize to hard cases. (See also here.)
Are you predicting better sample efficiency (against competitive baselines) or better OOD generalization? (Or do you reject this frame?)
The most concerning cases are probably intentional sandbagging, though I currently feel pretty good at avoiding this issue with current GPT-style architecture.
OOD generalization is a bit of a leaky/confusing abstraction. For instance, OOD behavior is probably a combination of sample efficiency and “true generalization”. And, with a tiny KL penalty nothing is technically fully OOD.