Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/​question.
The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.
Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/​question.
The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.