Neel Nanda comments on Refusal in LLMs is mediated by a single direction

Neel Nanda 28 Apr 2024 11:05 UTC
LW: 10 AF: 7
3
AF
There’s been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven’t seen much elsewhere, but I could easily be missing references
- Andy Arditi 1 May 2024 23:10 UTC
  2 points
  5
  Parent
  Check out LEACE (Belrose et al. 2023) - their “concept erasure” is similar to what we call “feature ablation” here.