danielbalsam comments on Refusal in LLMs is mediated by a single direction

danielbalsam 3 May 2024 5:03 UTC
0 points
0
Great post—thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?
- Andy Arditi 6 May 2024 9:45 UTC
  2 points
  0
  Parent
  The most finicky part of our methodology (and the part I’m least satisfied with currently) is in the selection of a direction.
  For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
  - 8B: (position_idx = −1, layer_idx = 12)
  - 70B: (position_idx = −5, layer_idx = 37)
  The position indexing assumes the usage of this prompt template, with two new lines appended to the end.