TurnTrout comments on Understanding and controlling a maze-solving policy network

TurnTrout 29 May 2024 17:21 UTC
LW: 21 AF: 8
8
AF
In light of Anthropic’s viral “Golden Gate Claude” activation engineering, I want to come back and claim the points I earned here.^[1]
I was extremely prescient in predicting the importance and power of activation engineering (then called “AVEC”). In January 2023, right after running the cheese vector as my first idea for what to do to interpret the network, and well before anyone ran LLM steering vectors… I had only seen the cheese-hiding vector work on a few mazes. Given that (seemingly) tiny amount of evidence, I immediately wrote down 60% credence that the technique would be a big deal for LLMs:
The algebraic value-editing conjecture (AVEC). It’s possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as “run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a ‘niceness vector’, and then add the niceness vector to future forward passes.”
Alex is ambivalent about strong versions of AVEC being true. Early on in the project, he booked the following credences (with italicized updates from present information):
Algebraic value editing works on Atari agents
50%
3/4/23: updated down to 30% due to a few other “X vectors” not working for the maze agent
3/9/23: updated up to 80% based off of additional results not in this post.
AVE performs at least as well as the fancier buzzsaw edit from RL vision paper
70%
3/4/23: updated down to 40% due to realizing that the buzzsaw moves in the visual field; higher than 30% because we know something like this is possible.
3/9/23: updated up to 60% based off of additional results.
AVE can quickly ablate or modify LM values without any gradient updates
60%
3/4/23: updated down to 35% for the same reason given in (1).
3/9/23: updated up to 65% based off of additional results and learning about related work in this vein.
And even if (3) is true, AVE working well or deeply or reliably is another question entirely. Still...
The cheese vector was easy to find. We immediately tried the dumbest, easiest first approach. We didn’t even train the network ourselves, we just used one of Langosco et al.’s nets (the first and only net we looked at). If this is the amount of work it took to (mostly) stamp out cheese-seeking, then perhaps a simple approach can stamp out e.g. deception in sophisticated models.
1. ^
  I generally think this work (https://arxiv.org/abs/2310.08043) and the GPT-2 steering work (https://arxiv.org/abs/2308.10248) are under-cited/-credited when it comes to the blossoming field of activation engineering, and want to call that out. Please cite this work when appropriate:
  @article{turner2023activation, title={Activation addition: Steering language models without optimization}, author={Turner, Alex and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte}, journal={arXiv preprint arXiv:2308.10248}, year={2023} }
  @article{mini2023understanding, title={Understanding and Controlling a Maze-Solving Policy Network}, author={Mini, Ulisse and Grietzer, Peli and Sharma, Mrinank and Meek, Austin and MacDiarmid, Monte and Turner, Alexander Matt}, journal={arXiv preprint arXiv:2310.08043}, year={2023} }