Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.
That’s my understanding.
Probably the increase should be interpreted as noise and doesn’t have a good explanation?
Agreed.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.