Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 6 May 2024 9:45 UTC
2 points
0
The most finicky part of our methodology (and the part I’m least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
- 8B: (position_idx = −1, layer_idx = 12)
- 70B: (position_idx = −5, layer_idx = 37)
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.