Neel Nanda comments on Refusal in LLMs is mediated by a single direction

Neel Nanda 5 May 2024 13:19 UTC
LW: 5 AF: 3
1
AF
Thanks! I’m personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B

The transformer lens library does not have a save feature :(

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
- wassname 5 May 2024 22:25 UTC
  1 point
  0
  AF Parent
  
  it feels less surgical than a single direction everywher
  
  Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.
  
  Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
  
  omg, I totally missed that, thanks. Let me know if I missed anything else, I just want to learn.
  
  The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.
  - Neel Nanda 6 May 2024 11:54 UTC
    LW: 3 AF: 2
    0
    AF Parent
    
    Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.
    
    Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we’re both wrong, and the most surgical intervention is deleting the direction from key layers only.
  - Nina Panickssery 6 May 2024 1:35 UTC
    3 points
    8
    Parent
    The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.
  - Andy Arditi 6 May 2024 9:18 UTC
    2 points
    1
    Parent
    For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.