Oh, huh, you’re right. But I am very confused by these few-prompt results. Are the prompts linked anywhere?
If I read this image correctly the “non-sycophantic prompt” increased sycophancy? This suggests to me that something went wrong with the choice of prompts here. It’s not impossible as a result, but I wonder whether the instructions around the prompt were too hard to understand for Llama 13B. I would really expect you could get a reduction in sycophancy with the right prompt.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.
Oh, huh, you’re right. But I am very confused by these few-prompt results. Are the prompts linked anywhere?
If I read this image correctly the “non-sycophantic prompt” increased sycophancy? This suggests to me that something went wrong with the choice of prompts here. It’s not impossible as a result, but I wonder whether the instructions around the prompt were too hard to understand for Llama 13B. I would really expect you could get a reduction in sycophancy with the right prompt.
That’s my understanding.
Probably the increase should be interpreted as noise and doesn’t have a good explanation?
Agreed.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.