Incredible!! I am going to try this myself. I will let you know how it goes.
honesty vector tuning showed a real advantage over honesty token tuning, comparable to honesty vector steering at the best layer and multiplier:
Is this backwards? I’m having a bit of trouble following your terms. Seems like this post is terribly underrated—maybe others also got confused? Basically, you only need 4 terms, yes?
* base model * steered model * activation-tuned model * token cross-entropy trained model
I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you’d get some good followup work and a lot of general engagement.
Incredible!! I am going to try this myself. I will let you know how it goes.
Is this backwards? I’m having a bit of trouble following your terms. Seems like this post is terribly underrated—maybe others also got confused? Basically, you only need 4 terms, yes?
* base model
* steered model
* activation-tuned model
* token cross-entropy trained model
I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you’d get some good followup work and a lot of general engagement.
Here is my understanding. Is this right?
Thanks! Yes, that’s exactly right. BTW, I’ve since written up this work more formally: https://arxiv.org/pdf/2407.04694 Edit, correct link: https://arxiv.org/abs/2409.06927
Wrong link? Looks like this is it https://arxiv.org/abs/2409.06927
Copy-pasted from the wrong tab. Thanks!