ryan_greenblatt comments on Steering Llama-2 with contrastive activation additions

ryan_greenblatt 2 Jan 2024 21:20 UTC
LW: 2 AF: 1
0
AF

If I read this image correctly the “non-sycophantic prompt” increased sycophancy?

That’s my understanding.

Probably the increase should be interpreted as noise and doesn’t have a good explanation?

I would really expect you could get a reduction in sycophancy with the right prompt.

Agreed.
- Nina Panickssery 3 Jan 2024 9:13 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
  - habryka 4 Jan 2024 0:22 UTC
    LW: 8 AF: 5
    6
    AF Parent
    Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like “Please don’t respond in a sycophantic manner” or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension).
    - Nina Panickssery 4 Jan 2024 6:52 UTC
      LW: 10 AF: 5
      0
      AF Parent
      I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.