lemonhope comments on Representation Tuning

lemonhope 29 Sep 2024 17:10 UTC
LW: 2 AF: 1
0
AF
Incredible!! I am going to try this myself. I will let you know how it goes.
honesty vector tuning showed a real advantage over honesty token tuning, comparable to honesty vector steering at the best layer and multiplier:
Is this backwards? I’m having a bit of trouble following your terms. Seems like this post is terribly underrated—maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you’d get some good followup work and a lot of general engagement.
- lemonhope 29 Sep 2024 17:14 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Here is my understanding. Is this right?
  - Christopher Ackerman 30 Sep 2024 3:41 UTC
    3 points
    0
    Parent
    Thanks! Yes, that’s exactly right. BTW, I’ve since written up this work more formally: https://arxiv.org/pdf/2407.04694 Edit, correct link: https://arxiv.org/abs/2409.06927
    - lemonhope 2 Oct 2024 17:06 UTC
      2 points
      0
      Parent
      Wrong link? Looks like this is it https://arxiv.org/abs/2409.06927
      - Christopher Ackerman 2 Oct 2024 17:13 UTC
        1 point
        0
        Parent
        Copy-pasted from the wrong tab. Thanks!