Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. “across all steering vectors” that’s pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya’lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don’t want to go too early because maybe in the very early layers it’s unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.
This is really cool! Exciting to see that it’s possible to explore the space of possible steering vectors without having to know what to look for a priori. I’m new to this field so I had a few questions. I’m not sure if they’ve been answered elsewhere
Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
It looks like the hypersphere constraint is so that the optimizer doesn’t select something far away due to being large. Is there any reason to use this sort of constraint other than that?
How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that’s not hard. I assume iterative “optimize and project” is not always optimal but maybe it’s usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I’ve never used them for ML. I’m guessing that in the bigger picture this doesn’t matter.
Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I’m curious of the degree of sparsity. You also mention doing LoRA for attention instead and I’m curious if you’ve tried it yet.
W.r.t. the “spiky” parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the “general processing” going on but I wonder if having a large power can axe the gradient for R < 1.
Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.
Thanks!