Nina Panickssery comments on Modulating sycophancy in an RLHF model via activation steering

Nina Panickssery 14 Aug 2023 20:30 UTC
10 points
0
Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.

The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.

I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.