Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out.
Algebraic value editing works (for at least one “X vector”) in LMs: 90 %
Algebraic value editing works better for larger models, all else equal 75 %
If value edits work well, they are also composable 80 %
If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out.
Algebraic value editing works (for at least one “X vector”) in LMs: 90 %
Algebraic value editing works better for larger models, all else equal 75 %
If value edits work well, they are also composable 80 %
If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
“truth-telling” 10 %
“love” 70 %
“accepting death” 20%
“speaking French” 80%