4gate comments on Mechanistically Eliciting Latent Behaviors in Language Models

4gate 6 May 2024 20:33 UTC
3 points
0
This is really cool! Exciting to see that it’s possible to explore the space of possible steering vectors without having to know what to look for a priori. I’m new to this field so I had a few questions. I’m not sure if they’ve been answered elsewhere
1. Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
2. It looks like the hypersphere constraint is so that the optimizer doesn’t select something far away due to being large. Is there any reason to use this sort of constraint other than that?
3. How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that’s not hard. I assume iterative “optimize and project” is not always optimal but maybe it’s usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I’ve never used them for ML. I’m guessing that in the bigger picture this doesn’t matter.
4. Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I’m curious of the degree of sparsity. You also mention doing LoRA for attention instead and I’m curious if you’ve tried it yet.
5. W.r.t. the “spiky” parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the “general processing” going on but I wonder if having a large power can axe the gradient for R < 1.
6. Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
7. Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.
Thanks!