I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?