cousin_it comments on Refusal in LLMs is mediated by a single direction

cousin_it 27 Apr 2024 17:53 UTC
LW: 16 AF: 5
3
AF
Sorry for maybe naive question. Which other behaviors X could be defeated by this technique of “find n instructions that induce X and n that don’t”? Would it work for X=unfriendliness, X=hallucination, X=wrong math answers, X=math answers that are wrong in one specific way, and so on?
What links here?
- AI #62: Too Soon to Tell by Zvi (2 May 2024 15:40 UTC; 30 points)
- Neel Nanda 28 Apr 2024 11:05 UTC
  LW: 10 AF: 7
  3
  AF Parent
  There’s been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven’t seen much elsewhere, but I could easily be missing references
  - Andy Arditi 1 May 2024 23:10 UTC
    2 points
    5
    Parent
    Check out LEACE (Belrose et al. 2023) - their “concept erasure” is similar to what we call “feature ablation” here.