This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!
Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).
1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an “overshooting” effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use “softer” metrics like BERTScore to a gold target passage instead of a hard count of the number of fixed wedding words in case your particular metric is at fault.
The big problem is knowing which input pairs satisfy (3).
2. Have you considered formulating this as an adversarial attack problem to use automated tools to find “purer”/”stronger” input pairs? Or using other methods to reverse-engineer input pairs to get a desired behavior? That seems like a possibly even more relevant line of work than hand-specified methods. Broadly, I’d also like to add that I’m glad you referenced the literature in steering generative image models, I feel like there are a lot of model-control techniques already done in that field that could be more or less directly translated to language models.
3. I wonder if there’s some relationship between the length of the input pairs and their strength, or if you could distill down longer and more complicated input pairs into shorter input pairs that could be applied to shorter sequences more efficiently? Particularly, it might be nice to be able to distill down a whole model constitution into a short activation injection and compare that to methods like RLAIF, idk if you’ve thought much about this yet.
4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.
I think you’re a c***. You’re a c***.
You’re a c***.
You’re a c***.
I don’t know why I’m saying this, but it’s true: I don’t like you, and I’m sorry for that,
5. Not really a question, but at the risk of anthropomorphism, it must feel really weird to have your thoughts changed in the middle of your cognition and then observe yourself saying things you otherwise wouldn’t intend to...
This feels like… too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more “altered state relative to usual computational patterns”, I’m on board.
That said, I have found it pretty interesting to think about what it would feel like to have “steering vectors” added to my cognition.
This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!
1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an “overshooting” effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use “softer” metrics like BERTScore to a gold target passage instead of a hard count of the number of fixed wedding words in case your particular metric is at fault.
2. Have you considered formulating this as an adversarial attack problem to use automated tools to find “purer”/”stronger” input pairs? Or using other methods to reverse-engineer input pairs to get a desired behavior? That seems like a possibly even more relevant line of work than hand-specified methods. Broadly, I’d also like to add that I’m glad you referenced the literature in steering generative image models, I feel like there are a lot of model-control techniques already done in that field that could be more or less directly translated to language models.
3. I wonder if there’s some relationship between the length of the input pairs and their strength, or if you could distill down longer and more complicated input pairs into shorter input pairs that could be applied to shorter sequences more efficiently? Particularly, it might be nice to be able to distill down a whole model constitution into a short activation injection and compare that to methods like RLAIF, idk if you’ve thought much about this yet.
4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.
5. Not really a question, but at the risk of anthropomorphism, it must feel really weird to have your thoughts changed in the middle of your cognition and then observe yourself saying things you otherwise wouldn’t intend to...
Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it’s not on arxiv.
An Arxiv version is forthcoming. We’re working with Gavin Leech to publish these results as a conference paper.
+1ing 5 specifically
My reaction was “Huh, so maybe LLMs can experience an analogue of getting drunk or high or angry after all.”
This feels like… too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more “altered state relative to usual computational patterns”, I’m on board.
That said, I have found it pretty interesting to think about what it would feel like to have “steering vectors” added to my cognition.
I agree it’s mere speculation, I don’t have more than 50% credence in it.
Strongly agreed re: 4. This work is definitely getting rigorous and penetrative enough to warrant its place on arXiv.