Jonathan Claybrough comments on Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Jonathan Claybrough 3 Apr 2023 8:22 UTC
11 points
1
Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out.
1. Algebraic value editing works (for at least one “X vector”) in LMs: 90 %
2. Algebraic value editing works better for larger models, all else equal 75 %
3. If value edits work well, they are also composable 80 %
4. If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
5. We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
  1. “truth-telling” 10 %
  2. “love” 70 %
  3. “accepting death” 20%
  4. “speaking French” 80%