Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
omg, I totally missed that, thanks. Let me know if I missed anything else, I just want to learn.
The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.
Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we’re both wrong, and the most surgical intervention is deleting the direction from key layers only.
The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.
For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.
Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.
omg, I totally missed that, thanks. Let me know if I missed anything else, I just want to learn.
The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.
Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we’re both wrong, and the most surgical intervention is deleting the direction from key layers only.
The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.
For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.