Buck comments on Refusal in LLMs is mediated by a single direction

Buck 29 Apr 2024 14:55 UTC
LW: 6 AF: 6
2
AF
I’m pretty skeptical that this technique is what you end up using if you approach the problem of removing refusal behavior technique-agnostically, e.g. trying to carefully tune your fine-tuning setup, and then pick the best technique.
- TurnTrout 2 May 2024 16:18 UTC
  LW: 12 AF: 9
  1
  AF Parent
  Because fine-tuning can be a pain and expensive? But you can probably do this quite quickly and painlessly.
  If you want to say finetuning is better than this, or (more relevantly) finetuning + this, can you provide some evidence?
- Neel Nanda 29 Apr 2024 18:33 UTC
  LW: 5 AF: 3
  3
  AF Parent
  I don’t think we really engaged with that question in this post, so the following is fairly speculative. But I think there’s some situations where this would be a superior technique, mostly low resource settings where doing a backwards pass is prohibitive for memory reasons, or with a very tight compute budget. But yeah, this isn’t a load bearing claim for me, I still count it as a partial victory to find a novel technique that’s a bit worse than fine tuning, and think this is significantly better than prior interp work. Seems reasonable to disagree though, and say you need to be better or bust