Moreover, granting neural networks, trading cognitive content has turned out to be not particularly hard. It does not require superintelligence to share representations between different neural networks; a language model can be adapted to handle visual data without enormous difficulty. Encodings from BERT or an ImageNet model can be applied to a variety of downstream tasks, and this is by now a standard element in toolkits and workflows. When you share architectures and training data, as for two differently fine-tuned diffusion models, you can get semantically meaningful merges between networks simply by taking the actual averages of their weights. Thoughts are not remotely “written in a different language.”
Huh, I am very surprised by this section. When I read the description I thought you would obviously call this prediction the other way around.
The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don’t think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
I am not confident there won’t be any breakthrough here, but my sense is that indeed current architectures feel somewhat uniquely bad for sharing cognitive content. And it seems very weird to call this prediction the other way.
I would also call this one for Eliezer. I think we mostly just retrain AI systems without reusing anything. I think that’s what you’d guess on Eliezer’s model, and very surprising on Robin’s model. The extent to which we throw things away is surprising even to a very simple common-sense observer.
I would have called “Human content is unimportant” for Robin—it seems like the existing ML systems that are driving current excitement (and are closest to being useful) lean extremely heavily on imitation of human experts and mostly don’t make new knowledge themselves. So far game-playing AI has been an exception rather than the rule (and this special case was already mostly established by the time of the debate).
That said, I think it would be reasonable to postpone judgment on most of these questions since we’re not yet in the end of days (Robin thinks it’s still fairly far, and Eliezer thinks it’s close but things will change a lot by the intelligence explosion). The main ones I’d be prepared to call unambiguously already are:
Short AI timelines and very general AI architectures: obvious advantage to Eliezer.
Importance of compute, massive capital investment, and large projects selling their output to the world: obvious advantage to Robin.
These aren’t literally settled, but market odds have moved really far since the debate, and they both seem like defining features of the current world. In each case I’d say that one of the two participants was clearly super wrong and the other was basically right.
If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn’t that a pretty big win for Eliezer? That seems to be kind of what is happening: this list mostly has larger models at the top, but not uniformly so.
I’d say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all.
Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years.
And I don’t think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.
Coming in late, but the surprising thing on Yudkowsky’s models is that compute was way more important than he realized, with it usually being 50⁄50 on the most favorable models to Yudkowsky, which means compute increases are not negligible, and algorithms aren’t totally dominant.
Even granting the assumption that algorithms will increasingly be a bottleneck, and compute being less important, Yudkowsky way overrated the power of algorithms/thinking hard compared to just getting more resources/scaling.
The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don’t think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
There’s a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
In addition to what cfoster0 said, I’m kinda excited about the next ~2-3 years of cross LLM knowledge transfer, so this seems a differing prediction about the future, which is fun.
My model for why it hasn’t happened already is in part just that most models know the same stuff, because they’re trained on extremely similar enormous swathes of text, so there’s no gain to be had by sticking them together. That would be why more effort goes into LLM / images / video glue than LLM / LLM glue.
But abstractly, a world where LLMs can meaningfully be connected to vision models but not on to other LLMs would be surprising to me. I expect something like training a model on code, and another model on non-code text, and then sticking them together to be possible.
Huh, I am very surprised by this section. When I read the description I thought you would obviously call this prediction the other way around.
The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don’t think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
I am not confident there won’t be any breakthrough here, but my sense is that indeed current architectures feel somewhat uniquely bad for sharing cognitive content. And it seems very weird to call this prediction the other way.
I would also call this one for Eliezer. I think we mostly just retrain AI systems without reusing anything. I think that’s what you’d guess on Eliezer’s model, and very surprising on Robin’s model. The extent to which we throw things away is surprising even to a very simple common-sense observer.
I would have called “Human content is unimportant” for Robin—it seems like the existing ML systems that are driving current excitement (and are closest to being useful) lean extremely heavily on imitation of human experts and mostly don’t make new knowledge themselves. So far game-playing AI has been an exception rather than the rule (and this special case was already mostly established by the time of the debate).
That said, I think it would be reasonable to postpone judgment on most of these questions since we’re not yet in the end of days (Robin thinks it’s still fairly far, and Eliezer thinks it’s close but things will change a lot by the intelligence explosion). The main ones I’d be prepared to call unambiguously already are:
Short AI timelines and very general AI architectures: obvious advantage to Eliezer.
Importance of compute, massive capital investment, and large projects selling their output to the world: obvious advantage to Robin.
These aren’t literally settled, but market odds have moved really far since the debate, and they both seem like defining features of the current world. In each case I’d say that one of the two participants was clearly super wrong and the other was basically right.
If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn’t that a pretty big win for Eliezer? That seems to be kind of what is happening: this list mostly has larger models at the top, but not uniformly so.
I’d say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all.
Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years.
And I don’t think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.
Coming in late, but the surprising thing on Yudkowsky’s models is that compute was way more important than he realized, with it usually being 50⁄50 on the most favorable models to Yudkowsky, which means compute increases are not negligible, and algorithms aren’t totally dominant.
Even granting the assumption that algorithms will increasingly be a bottleneck, and compute being less important, Yudkowsky way overrated the power of algorithms/thinking hard compared to just getting more resources/scaling.
There’s a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
Averaging weights to transfer knowledge is not unique to diffusion models. It works on image models trained with non-diffusion setups (https://arxiv.org/abs/2203.05482, https://arxiv.org/abs/2304.03094) as well as on non-image tasks such as language modeling (https://arxiv.org/abs/2208.03306, https://arxiv.org/abs/2212.04089). Exchanging knowledge between language models via weight averaging is possible provided that the models share a common initialization + early training trajectory. And if you allow for more methods than weight averaging, simple stuff like Knowledge Distillation or stitching via cross-attention (https://arxiv.org/abs/2106.13884) are tricks known to work for transferring knowledge.
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works. I don’t know much about Knowledge Distillation, so maybe that is actually something that would fit the “knowledge sharing is easy” description (my models here aren’t very confident, and I don’t have super strong predictions on whether knowledge sharing among LLMs is possible or impossible, my sense was just that so far we haven’t succeeded at doing it without very large costs, which is why, as far as I can tell, new large language models are basically always trained from scratch after we made some architectural changes).
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
In addition to what cfoster0 said, I’m kinda excited about the next ~2-3 years of cross LLM knowledge transfer, so this seems a differing prediction about the future, which is fun.
My model for why it hasn’t happened already is in part just that most models know the same stuff, because they’re trained on extremely similar enormous swathes of text, so there’s no gain to be had by sticking them together. That would be why more effort goes into LLM / images / video glue than LLM / LLM glue.
But abstractly, a world where LLMs can meaningfully be connected to vision models but not on to other LLMs would be surprising to me. I expect something like training a model on code, and another model on non-code text, and then sticking them together to be possible.