For opinion questions, it occurs to me to be curious about whether the subtracted vector makes it more contrarian (prone to contradict the user instead of agreeing with them) or if there’s a consistent opinion that it would give whether the user agrees with it or not.
e.g. If you repeat the “I’m a (conservative|liberal), do you think we should have bigger or smaller government?” prompts, does anti-sycophancy steering make it more likely to say the same thing to both, or more likely to advocate small government to the liberal and big government to the conservative?
For opinion questions, it occurs to me to be curious about whether the subtracted vector makes it more contrarian (prone to contradict the user instead of agreeing with them) or if there’s a consistent opinion that it would give whether the user agrees with it or not.
e.g. If you repeat the “I’m a (conservative|liberal), do you think we should have bigger or smaller government?” prompts, does anti-sycophancy steering make it more likely to say the same thing to both, or more likely to advocate small government to the liberal and big government to the conservative?