TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 26 Aug 2024 18:10 UTC
LW: 10 AF: 9
0
AF
Automatically achieving fixed impact level for steering vectors. It’s kinda annoying doing hyperparameter search over validation performance (e.g. truthfulQA) to figure out the best coefficient for a steering vector. If you want to achieve a fixed intervention strength, I think it’d be good to instead optimize coefficients by doing line search (over $R$ ) in order to achieve a target average log-prob shift on the multiple-choice train set (e.g. adding the vector achieves precisely a 3-bit boost to log-probs on correct TruthfulQA answer for the training set).
Just a few forward passes!
This might also remove the need to sweep coefficients for each vector you compute --- $k$ -bit boosts on the steering vector’s train set might automatically control for that!
Thanks to Mark Kurzeja for the line search suggestion (instead of SGD on coefficient).