habryka comments on Steering Llama-2 with contrastive activation additions

habryka 2 Jan 2024 20:45 UTC
LW: 2 AF: 2
0
AF
Oh, huh, you’re right. But I am very confused by these few-prompt results. Are the prompts linked anywhere?
If I read this image correctly the “non-sycophantic prompt” increased sycophancy? This suggests to me that something went wrong with the choice of prompts here. It’s not impossible as a result, but I wonder whether the instructions around the prompt were too hard to understand for Llama 13B. I would really expect you could get a reduction in sycophancy with the right prompt.
- ryan_greenblatt 2 Jan 2024 21:20 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  If I read this image correctly the “non-sycophantic prompt” increased sycophancy?
  
  That’s my understanding.
  
  Probably the increase should be interpreted as noise and doesn’t have a good explanation?
  
  I would really expect you could get a reduction in sycophancy with the right prompt.
  
  Agreed.
  - Nina Panickssery 3 Jan 2024 9:13 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
    - habryka 4 Jan 2024 0:22 UTC
      LW: 8 AF: 5
      6
      AF Parent
      Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
      - Nina Panickssery 4 Jan 2024 6:52 UTC
        LW: 10 AF: 5
        0
        AF Parent
        I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.