Automatically achieving fixed impact level for steering vectors. It’s kinda annoying doing hyperparameter search over validation performance (e.g. truthfulQA) to figure out the best coefficient for a steering vector. If you want to achieve a fixed intervention strength, I think it’d be good to instead optimize coefficients by doing line search (over R) in order to achieve a target average log-prob shift on the multiple-choice train set (e.g. adding the vector achieves precisely a 3-bit boost to log-probs on correct TruthfulQA answer for the training set).
Just a few forward passes!
This might also remove the need to sweep coefficients for each vector you compute --- k-bit boosts on the steering vector’s train set might automatically control for that!
Thanks to Mark Kurzeja for the line search suggestion (instead of SGD on coefficient).
Automatically achieving fixed impact level for steering vectors. It’s kinda annoying doing hyperparameter search over validation performance (e.g. truthfulQA) to figure out the best coefficient for a steering vector. If you want to achieve a fixed intervention strength, I think it’d be good to instead optimize coefficients by doing line search (over R) in order to achieve a target average log-prob shift on the multiple-choice train set (e.g. adding the vector achieves precisely a 3-bit boost to log-probs on correct TruthfulQA answer for the training set).
Just a few forward passes!
This might also remove the need to sweep coefficients for each vector you compute --- k-bit boosts on the steering vector’s train set might automatically control for that!
Thanks to Mark Kurzeja for the line search suggestion (instead of SGD on coefficient).