One perspective is that representation engineering allows us to do “single-bit edits” to the network’s behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying “we trained the model, it learnt some unknown circuit that fits the data” we can say “no new circuits were learnt, we just flipped a flag and this fits the data”.
In the world where this works, and works well enough to replace RLHF (or some other major part of the training process), we should end up with more controlled network edits.
One perspective is that representation engineering allows us to do “single-bit edits” to the network’s behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying “we trained the model, it learnt some unknown circuit that fits the data” we can say “no new circuits were learnt, we just flipped a flag and this fits the data”.
In the world where this works, and works well enough to replace RLHF (or some other major part of the training process), we should end up with more controlled network edits.