Alex was reasonably confident (pre-registered prediction) that activation addition would beat few-shot prompting in this setting. The few-shot prompts were pro- or anti-sycophantic, or neutral. We measured the likelihood of the sycophantic A/B answer:
I can’t access the link, so I can’t verify the operationalization.
However, looking at the claim itself, it looks like this prediction was false, but I want to verify that. On 13B, which seems like the most relevant model to look at, the difference between no-prompting and sycophantic prompting is larger than the gain that either line gets from adding the activation vector (the red line is the benefit of few-shot prompting, I think, and the bottom purple line is the benefit from activation vector addition):
Edit: My lines are about increasing sycophancy (which I think isn’t super relevant). The chart below supports the interpretation that activation steering did help with decreasing sycophancy more than prompting, though it also looks like prompting with “non-sycophantic prompts” actively increased sycophancy, which is very confusing to me.
I agree that it seems like maybe it worked better on the 7B model, but it seems like in-general we care most about the results on the largest available models. The fact that prompting didn’t help at all on the 7B model also makes me think the 13B comparison is better, since I am confident that prompting in-general does something.
I can’t access the link, so I can’t verify the operationalization.
Ah, sorry, the question said “shared” but in fact was just shared with a few people. It wasn’t operationalized much at all. We figured getting any prediction questions at all was better than nothing. Here’s the screenshot:
I think the prediction seems false for increasing sycophancy, but seems true for decreasing sycophancy.
I’m unsure why prompting doesn’t work to decrease sycophancy, but maybe it’s not sufficiently saliant to work with only a few examples? Maybe if you explicitly said “don’t be sycophantic” and gave the examples it would work better?
Oh, huh, you’re right. But I am very confused by these few-prompt results. Are the prompts linked anywhere?
If I read this image correctly the “non-sycophantic prompt” increased sycophancy? This suggests to me that something went wrong with the choice of prompts here. It’s not impossible as a result, but I wonder whether the instructions around the prompt were too hard to understand for Llama 13B. I would really expect you could get a reduction in sycophancy with the right prompt.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.
I can’t access the link, so I can’t verify the operationalization.
However, looking at the claim itself, it looks like this prediction was false, but I want to verify that. On 13B, which seems like the most relevant model to look at, the difference between no-prompting and sycophantic prompting is larger than the gain that either line gets from adding the activation vector (the red line is the benefit of few-shot prompting, I think, and the bottom purple line is the benefit from activation vector addition):
Edit: My lines are about increasing sycophancy (which I think isn’t super relevant). The chart below supports the interpretation that activation steering did help with decreasing sycophancy more than prompting, though it also looks like prompting with “non-sycophantic prompts” actively increased sycophancy, which is very confusing to me.
I agree that it seems like maybe it worked better on the 7B model, but it seems like in-general we care most about the results on the largest available models. The fact that prompting didn’t help at all on the 7B model also makes me think the 13B comparison is better, since I am confident that prompting in-general does something.
Ah, sorry, the question said “shared” but in fact was just shared with a few people. It wasn’t operationalized much at all. We figured getting any prediction questions at all was better than nothing. Here’s the screenshot:
I think the prediction seems false for increasing sycophancy, but seems true for decreasing sycophancy.
I’m unsure why prompting doesn’t work to decrease sycophancy, but maybe it’s not sufficiently saliant to work with only a few examples? Maybe if you explicitly said “don’t be sycophantic” and gave the examples it would work better?
Oh, huh, you’re right. But I am very confused by these few-prompt results. Are the prompts linked anywhere?
If I read this image correctly the “non-sycophantic prompt” increased sycophancy? This suggests to me that something went wrong with the choice of prompts here. It’s not impossible as a result, but I wonder whether the instructions around the prompt were too hard to understand for Llama 13B. I would really expect you could get a reduction in sycophancy with the right prompt.
That’s my understanding.
Probably the increase should be interpreted as noise and doesn’t have a good explanation?
Agreed.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.