substantial reductions in sycophancy, beyond whatever was achieved with Meta’s finetuning
Where is this shown? Most of the results don’t evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF.
My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:?
Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.
The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.
Where is this shown? Most of the results don’t evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF.
My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:?
Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.
The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.