neverix comments on SAE features for refusal and sycophancy steering vectors

neverix 16 Oct 2024 17:48 UTC
2 points
0
1. It doesn’t really make sense to interpret feature activation values as log probabilities. If we did, we’d have to worry about scaling. It’s also not guaranteed the score wouldn’t just decrease because of decreased accuracy on correct answers.
2. Phi seems specialized for MMLU-like problems and has an outsized score for a model its size, I would be surprised if it’s biased because of the format of the question. However, it’s possible using answers instead of letters would help improve raw accuracy in this case because the feature we used (45142) seems to max-activate on plain text and not multiple choice answers and it’s somewhat surprising it does this well on multiple-choice questions. For this reason, I think using text answers will boost the feature’s score but won’t change the relative ranking of features by accuracy. I don’t know how to adapt your proposed experiment for the task of finding how accurate a feature is at eliciting the model’s knowledge of correct answers.
3. This is a technical detail. We use a jax.lax.scan by the layer index and store layers in one structure with a special named axis. This mainly improves compilation time. Penzai has since implemented the technique in the main library.