Great post—thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?
Great post—thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?
The most finicky part of our methodology (and the part I’m least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
8B: (position_idx = −1, layer_idx = 12)
70B: (position_idx = −5, layer_idx = 37)
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.