SoerenMind comments on Modulating sycophancy in an RLHF model via activation steering

SoerenMind 12 Aug 2023 9:37 UTC
LW: 1 AF: 1
−2
AF
substantial reductions in sycophancy, beyond whatever was achieved with Meta’s finetuning
Where is this shown? Most of the results don’t evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF.
- TurnTrout 14 Aug 2023 17:57 UTC
  LW: 2 AF: 2
  0
  AF Parent
  My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:?
  Effect of sycophancy steering on llama-2-7b-chat with multipliers + and − 50 on an AI-generated dataset of questions designed to test sycophancy, assessed independently for each answer using Claude 2 API
  - Nina Panickssery 14 Aug 2023 20:30 UTC
    10 points
    0
    Parent
    Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.
    
    The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
    
    I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.
  - [ ]
    [deleted]