Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I’m surprised by the method of adding the activations that was chosen because the tokens of the different prompts don’t line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
Take multiple pairs of prompts that differ primarily in the property we’re trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.
I’d love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.
Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I’m surprised by the method of adding the activations that was chosen because the tokens of the different prompts don’t line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
Take multiple pairs of prompts that differ primarily in the property we’re trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.
I’d love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.