The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don’t think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
There’s a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
There’s a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
Averaging weights to transfer knowledge is not unique to diffusion models. It works on image models trained with non-diffusion setups (https://arxiv.org/abs/2203.05482, https://arxiv.org/abs/2304.03094) as well as on non-image tasks such as language modeling (https://arxiv.org/abs/2208.03306, https://arxiv.org/abs/2212.04089). Exchanging knowledge between language models via weight averaging is possible provided that the models share a common initialization + early training trajectory. And if you allow for more methods than weight averaging, simple stuff like Knowledge Distillation or stitching via cross-attention (https://arxiv.org/abs/2106.13884) are tricks known to work for transferring knowledge.
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?