Nina Panickssery comments on Steering Llama-2 with contrastive activation additions

Nina Panickssery 4 Jan 2024 6:52 UTC
LW: 10 AF: 5
0
AF
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.