Nina Rimsky
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
I am contrasting generating an output by:
Modeling how a human would respond (“human modeling in output generation”)
Modeling what the ground-truth answer is
Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)
I expect if you average over more contrast pairs, like in CAA (https://arxiv.org/abs/2312.06681), more of the spurious features in steering vectors are cancelled out leading to higher quality vectors and greater sparsity in the dictionary feature domain. Did you find this?
This is really cool work!!
In other experiments we’ve run (not presented here), the MSP is not well-represented in the final layer but is instead spread out amongst earlier layers. We think this occurs because in general there are groups of belief states that are degenerate in the sense that they have the same next-token distribution. In that case, the formalism presented in this post says that even though the distinction between those states must be represented in the transformers internal, the transformer is able to lose those distinctions for the purpose of predicting the next token (in the local sense), which occurs most directly right before the unembedding.
Would be interested to see analyses where you show how an MSP is spread out amongst earlier layers.
Presumably, if the model does not discard intermediate results, something like concatenating residual stream vectors from different layers and then linearly correlating with the ground truth belief-state-over-HMM-states vector extracts the same kind of structure you see when looking at the final layer. Maybe even with the same model you analyze, the structure will be crisper if you project the full concatenated-over-layers resid stream, if there is noise in the final layer and the same features are represented more cleanly in earlier layers?
In cases where redundant information is discarded at some point, this is a harder problem of course.
Profound!
It’s possible that once my iron reserves were replenished through supplementation, the amount of iron needed to maintain adequate levels was lower, allowing me to maintain my iron status through diet alone. Iron is stored in the body in various forms, primarily in ferritin, and when levels are low, the body draws upon these reserves.
I’ll never know for sure, but the initial depletion of my iron reserves could have been due to a chest infection I had around that time (as infections can lead to decreased iron absorption and increased iron loss) or a period of unusually poor diet.
Once my iron reserves were replenished, my regular diet seemed to be sufficient to prevent a recurrence of iron deficiency, as the daily iron requirement for maintenance is lower than the amount needed to correct a deficiency.
The wrapper modules simply wrap existing submodules of the model, and call whatever they are wrapping (in this case
self.attn
) with the same arguments, and then save some state / do some manipulation of the output. It’s just the syntax I chose to use to be able to both save state from submodules, and manipulate the values of some intermediate state. If you want to see exactly how that submodule is being called, you can look at the llama huggingface source code. In the code you gave, I am adding some vector to thehidden_states
returned by that attention submodule.
We used the same steering vectors, derived from the non fine-tuned model
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.
Yes, this is almost correct. The test task had the A/B question followed by
My answer is (
after the end instruction token, and the steering vector was added to every token position after the end instruction token, so to all ofMy answer is (
.
Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.
Steering Llama-2 with contrastive activation additions
Is my understanding correct or am I missing something?
A latent variable is a variable you can sample that gives you some subset of the mutual information between all the different X’s + possibly some independent extra “noise” unrelated to the X’s
A natural latent is a variable that you can sample that at the limit of sampling will give you all the mutual information between the X’s—nothing more or less
E.g. in the biased die example, every each die roll sample has, in expectation, the same information content, which is the die bias + random noise, and so the mutual info of n rolls is the die bias itself
(where “different X’s” above can be thought of as different (or the same) distributions over the observation space of X corresponding to the different sampling instances, perhaps a non-standard framing)
suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse.
Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.
Examples from here:Relevant paper
A framing for interpretability
It does seem like small initialisation is a regularisation of a sort, but it seems pretty hard to imagine how it might first allow a memorising solution to be fully learned, and then a generalising solution.
“Memorization” is more parallelizable and incrementally learnable than learning generalizing solutions and can occur in an orthogonal subspace of the parameter space to the generalizing solution.
And so one handwavy model I have of this is a low parameter norm initializes the model closer to the generalizing solution than otherwise, and so a higher proportion of the full parameter space is used for generalizing solutions.
The actual training dynamics here would be the model first memorizes a high proportion of the training data while simultaneously learning a lossy/inaccurate version of the generalizing solution in another subspace (the “prioritization” / “how many dimensions are being used” extent of the memorization being affected by the initialization norm). Then, later in training, the generalization can “win out” (due to greater stability / higher performance / other regularization).
Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!
The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.