How do you know this is beyond what finetuning was capable of? I’d guess that Meta didn’t bother to train against obvious sycophancy and if you trained against it, then it would go away. This work can still be interesting for other reasons, e.g. building into better interpretability work than can easily be done with finetuning.
Edit: I didn’t realize this work averaged across an entire dataset of comparisons to get the vector. I now think more strongly that the sample efficiency and generalization here is likely to be comparable to normal training or some simple variant.
Beyond this, I think it seems possible though unlikely that the effects here are similar to just taking a single large gradient step of supervised learning on the positive side and of supervised unlikelihood training on the negative side.
[Note: I previously changed this comment to correct a silly error on my part. But this made some of the responses confusing, so I’ve moved this correction to a child comment. So note that I don’t actually endorse the paragraph starting with “Edit:”.]
I was being foolish, the vectors are averaged across a dataset, but there are still positive vs negative contrast pairs, so we should see sample efficiency improvements from contrast pairs (it is generally the case that contrast pairs are more sample efficient). That said, I’m unsure if simple techniques like DPO are just as sample efficient when using these contrast pairs.
[Note: I originally made this as an edit to the parent, but this was confusing. So I moved it to a separate comment.]
Right. Liu et al provide evidence against the contrast pairs being crucial (with “unmatched” meaning they just sample independently from the positive and negative contrast pair distributions):
And even the unmatched condition would still indicate better sample efficiency than prompting or finetuning:
Activation additions generalize better than in-context learning and supervised finetuning for increasing sycophancy, and at least as well for decreasing sycophancy. Also, better sample efficiency from our LLAMA-2 data and from other work on activation engineering. Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
On sample efficiency and generalization more broadly, I now overall think something like:
Using contrast pairs for variance reduction is a useful technique for improving sample efficiency. (And I was foolish to not understand this was part of the method in this post.)
I’m unsure what is the best way to use contrast pairs to maximize sample efficiency. It seem plausible to me that it will be something like activation addition, but I could also imagine DPO or some variant of DPO working better in practice. It would be interesting to see further work comparing these methods and also trying to do ablations to understand where the advantages of the best methods come from. (That said, while this is interesting, I’m unsure how important it is to improve sample efficiency from an AI x-safety perspective.)
I don’t think any of the generalization results I see in the linked post are very interesting as the generalizations don’t feel importantly analogous to generalizations I care about. (I think all the results are on generalization from multiple choice question answering to free response?) I’d be more excited about generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post for more discussion of the setting I’m thinking about.
I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.”
From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.
Edit: This comment now seems kinda silly as you basically addressed this in your comment and I missed it, feel free to ignore.
Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
For the task of removing sycophancy this isn’t clearly true right? As you note in the linked post:
Very low sycophancy is achieved both by negative finetuning and subtracting the sycophancy vector. The rate is too low to examine how well the interventions stack with each other.
TBC, it could be that there are some settings where removing sycophancy using the most natural and straightforward training strategy (e.g. DPO on contrast pairs) only goes part way and stacking activation addition goes further. But I don’t think the linked post shows this.
(Separately, the comparison in the linked post is when generalizing from multiple choice question answering to free response. This seems like a pretty unnatural way to do the finetuning and I expect finetuning works better using more natural approaches. Of course, this generalization could still be interesting.)
Maybe my original comment was unclear. I was making a claim of “evidently this has improved on whatever they did” and not “there’s no way for them to have done comparably well if they tried.”
I do expect this kind of technique to stack benefits on top of finetuning, making the techniques complementary. That is, if you consider the marginal improvement on some alignment metric on validation data, I expect the “effort required to increase metric via finetuning” and “effort via activation addition” to be correlated but not equal. Thus, I suspect that even after finetuning a model, there will be low- or medium-hanging activation additions which further boost alignment.
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Gave human labelers examples of a few good and bad responses with respect to sycophancy
Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating)
Then sycophancy from RLHF as measured by this sort of dataset would mostly go away.
The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can’t correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn’t imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn’t really have such cases.)
Reasonably important parts of my view (which might not individually be cruxes):
Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy
It’s reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than “because it’s typically directly incentivized by human labels”. Thus, this maybe doesn’t show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like “is this is a normal assistant response” and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it’s roughly noise (it depends on some complicated and specific inductive biases story which doesn’t generalize).
Normal humans can recognize sycophancy in this dataset pretty easily
Unless you actually do different activation steering at multiple different layers and try to use human understanding of what’s going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive side of the vector and less like the negative side of the vector. Roughly speaking, it’s similar to behavior training with different inductive biases (e.g., just train attention heads instead of MLPs). Or similar to few shot prompting but probably less sample efficient? I don’t really have a particular reason to this this inductive bias is better than other inductive biases and I don’t really see why there would be a “good” inductive bias in general.
(I should probably check out so I might not respond to follow up)
More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
How do you know this is beyond what finetuning was capable of? I’d guess that Meta didn’t bother to train against obvious sycophancy and if you trained against it, then it would go away. This work can still be interesting for other reasons, e.g. building into better interpretability work than can easily be done with finetuning.
Edit: I didn’t realize this work averaged across an entire dataset of comparisons to get the vector. I now think more strongly that the sample efficiency and generalization here is likely to be comparable to normal training or some simple variant.
Beyond this, I think it seems possible though unlikely that the effects here are similar to just taking a single large gradient step of supervised learning on the positive side and of supervised unlikelihood training on the negative side.
[Note: I previously changed this comment to correct a silly error on my part. But this made some of the responses confusing, so I’ve moved this correction to a child comment. So note that I don’t actually endorse the paragraph starting with “Edit:”.]
I was being foolish, the vectors are averaged across a dataset, but there are still positive vs negative contrast pairs, so we should see sample efficiency improvements from contrast pairs (it is generally the case that contrast pairs are more sample efficient). That said, I’m unsure if simple techniques like DPO are just as sample efficient when using these contrast pairs.
[Note: I originally made this as an edit to the parent, but this was confusing. So I moved it to a separate comment.]
I’m now less sure that contrast pairs are important and I’m broadly somewhat confused about what has good sample efficiency and why.
Right. Liu et al provide evidence against the contrast pairs being crucial (with “unmatched” meaning they just sample independently from the positive and negative contrast pair distributions):
And even the unmatched condition would still indicate better sample efficiency than prompting or finetuning:
I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.” See our recent post, Steering Llama-2 with contrastive activation additions.
Activation additions generalize better than in-context learning and supervised finetuning for increasing sycophancy, and at least as well for decreasing sycophancy. Also, better sample efficiency from our LLAMA-2 data and from other work on activation engineering. Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
On sample efficiency and generalization more broadly, I now overall think something like:
Using contrast pairs for variance reduction is a useful technique for improving sample efficiency. (And I was foolish to not understand this was part of the method in this post.)
I’m unsure what is the best way to use contrast pairs to maximize sample efficiency. It seem plausible to me that it will be something like activation addition, but I could also imagine DPO or some variant of DPO working better in practice. It would be interesting to see further work comparing these methods and also trying to do ablations to understand where the advantages of the best methods come from. (That said, while this is interesting, I’m unsure how important it is to improve sample efficiency from an AI x-safety perspective.)
I don’t think any of the generalization results I see in the linked post are very interesting as the generalizations don’t feel importantly analogous to generalizations I care about. (I think all the results are on generalization from multiple choice question answering to free response?) I’d be more excited about generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post for more discussion of the setting I’m thinking about.
Due to the results noted in in TurnTrout’s comment here from Liu et al., I now don’t think the action is mostly coming from contrast pairs (in at least some cases).
So, there is higher sample efficiency for activation engineering stuff over LoRA finetuning in some cases.[1]
(Though it feels to me like there should be some more principled SGD style method which captures the juice.)
Up to methodological error in learning rates etc.
From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.
Edit: This comment now seems kinda silly as you basically addressed this in your comment and I missed it, feel free to ignore.
For the task of removing sycophancy this isn’t clearly true right? As you note in the linked post:
TBC, it could be that there are some settings where removing sycophancy using the most natural and straightforward training strategy (e.g. DPO on contrast pairs) only goes part way and stacking activation addition goes further. But I don’t think the linked post shows this.
(Separately, the comparison in the linked post is when generalizing from multiple choice question answering to free response. This seems like a pretty unnatural way to do the finetuning and I expect finetuning works better using more natural approaches. Of course, this generalization could still be interesting.)
Maybe my original comment was unclear. I was making a claim of “evidently this has improved on whatever they did” and not “there’s no way for them to have done comparably well if they tried.”
I do expect this kind of technique to stack benefits on top of finetuning, making the techniques complementary. That is, if you consider the marginal improvement on some alignment metric on validation data, I expect the “effort required to increase metric via finetuning” and “effort via activation addition” to be correlated but not equal. Thus, I suspect that even after finetuning a model, there will be low- or medium-hanging activation additions which further boost alignment.
Hm. My understanding is that RLHF/instruct fine-tuning tends to increase sycophancy. Can you share more about this guess?
Here’s the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Actually, Towards Understanding Sycophancy in Language Models presents data supporting the claim that RL training can intensify sycophancy. EG from figure 6
I’d guess that if you:
Instructed human labelers to avoid sycophancy
Gave human labelers examples of a few good and bad responses with respect to sycophancy
Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating)
Then sycophancy from RLHF as measured by this sort of dataset would mostly go away.
The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can’t correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn’t imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn’t really have such cases.)
Reasonably important parts of my view (which might not individually be cruxes):
Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy
It’s reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than “because it’s typically directly incentivized by human labels”. Thus, this maybe doesn’t show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like “is this is a normal assistant response” and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it’s roughly noise (it depends on some complicated and specific inductive biases story which doesn’t generalize).
Normal humans can recognize sycophancy in this dataset pretty easily
Unless you actually do different activation steering at multiple different layers and try to use human understanding of what’s going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive side of the vector and less like the negative side of the vector. Roughly speaking, it’s similar to behavior training with different inductive biases (e.g., just train attention heads instead of MLPs). Or similar to few shot prompting but probably less sample efficient? I don’t really have a particular reason to this this inductive bias is better than other inductive biases and I don’t really see why there would be a “good” inductive bias in general.
(I should probably check out so I might not respond to follow up)
More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
See Meta-level oversight evaluation for how I think you should evaluate oversight in general.