Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I’m surprised by the method of adding the activations that was chosen because the tokens of the different prompts don’t line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
Take multiple pairs of prompts that differ primarily in the property we’re trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.
I’d love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.
On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.
The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.
Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value → value → belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.