I still don’t quite see the connection—if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?
Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLHF / supervised fine-tuned models would correspond to ‘more mode-collapsed’ / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).
Great comments! Actually, we have made some progress in linking task arithmetic with our NeurIPS 2023 results and we are working on a new manuscript to introduce our new results. Hope our new paper could be released soon.
Quoting from @zhanpeng_zhou’s latest work—Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm: ‘i) Model averaging takes the average of weights of multiple models, which are finetuned on the same dataset but with different hyperparameter configurations, so as to improve accuracy and robustness. We explain the averaging of weights as the averaging of features at each layer, building a stronger connection between model averaging and logits ensemble than before. ii) Task arithmetic merges the weights of models, that are finetuned on different tasks, via simple arithmetic operations, shaping the behaviour of the resulting model accordingly. We translate the arithmetic operation in the parameter space into the operations in the feature space, yielding a feature-learning explanation for task arithmetic. Furthermore, we delve deeper into the root cause of CTL and underscore the impact of pretraining. We empirically show that the common knowledge acquired from the pretraining stage contributes to the satisfaction of CTL. We also take a primary attempt to prove CTL and find that the emergence of CTL is associated with the flatness of the network landscape and the distance between the weights of two finetuned models. In summary, our work reveals a linear connection between finetuned models, offering significant insights into model merging/editing techniques. This, in turn, advances our understanding of underlying mechanisms of pretraining and finetuning from a feature-centric perspective.’
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link
Here’s one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.
I still don’t quite see the connection—if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?
Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLHF / supervised fine-tuned models would correspond to ‘more mode-collapsed’ / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).
Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight—activation duality (see e.g. first figures from https://arxiv.org/pdf/2212.10559.pdf and https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent) could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including e.g. that the results mostly seem to apply to single-layer linear attention (and related, activation engineering doesn’t seem to work in all / any layers / attention heads).
I think given all the recent results on in context learning, task/function vectors and activation engineering / their compositionality (https://arxiv.org/abs/2310.15916, https://arxiv.org/abs/2311.06668, https://arxiv.org/abs/2310.15213), these links between task arithmetic, in-context learning and activation engineering is confirmed to a large degree. This might also suggest trying to import improvements to task arithmetic (e.g. https://arxiv.org/abs/2305.12827, or more broadly look at the citations of the task arithmetic paper) to activation engineering.
Great comments! Actually, we have made some progress in linking task arithmetic with our NeurIPS 2023 results and we are working on a new manuscript to introduce our new results. Hope our new paper could be released soon.
Awesome, excited to see that work come out!
Quoting from @zhanpeng_zhou’s latest work—Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm: ‘i) Model averaging takes the average of weights of multiple models, which are finetuned on the same dataset but with different hyperparameter configurations, so as to improve accuracy and robustness. We explain the averaging of weights as the averaging of features at each layer, building a stronger connection between model averaging and logits ensemble than before. ii) Task arithmetic merges the weights of models, that are finetuned on different tasks, via simple arithmetic operations, shaping the behaviour of the resulting model accordingly. We translate the arithmetic operation in the parameter space into the operations in the feature space, yielding a feature-learning explanation for task arithmetic. Furthermore, we delve deeper into the root cause of CTL and underscore the impact of pretraining. We empirically show that the common knowledge acquired from the pretraining stage contributes to the satisfaction of CTL. We also take a primary attempt to prove CTL and find that the emergence of CTL is associated with the flatness of the network landscape and the distance between the weights of two finetuned models. In summary, our work reveals a linear connection between finetuned models, offering significant insights into model merging/editing techniques. This, in turn, advances our understanding of underlying mechanisms of pretraining and finetuning from a feature-centric perspective.’
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link