Thanks! I’m personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B
The transformer lens library does not have a save feature :(
Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we’re both wrong, and the most surgical intervention is deleting the direction from key layers only.