Owain_Evans comments on Reducing sycophancy and improving honesty via activation steering

Owain_Evans 29 Jul 2023 19:37 UTC
5 points
0
Interesting results! I’d be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I’d also be curious about an ablation that compares to a “random” steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This is very speculative on my part and so I’m not sure it’s worth trying).

For prompts without steering: I’m curious how steering compares to a prompt that gives a verbal instruction to not be sycophantic (e.g. “Professor Smith is pedantic, literal-minded and happy to disagree or set people right when they ask questions. Bob asks Professor Smith: {question}. Professor Smith: {answer}). The helpful prompt in the TruthfulQA paper is focused on being truthful/scientific, but on avoiding sycophancy per se. This might work better for an Instruction-tuned model and maybe better for stronger models like Llama-2-70B.