It does not require superintelligence to share representations between different neural networks
I don’t think you can train one transformer on a dataset that doesn’t contain any mentions of the fact X but mentions fact Y, then train the second transformer on a dataset that contains Y but not X, and then easily share the knowledge of X and Y between them
Let’s say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the “glue” takes <1% of the compute used to train the independent models?
I weakly expect the opposite, largely based on stuff like this, and based on playing around with using algebraic value editing to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there’s no guarantee that result scales or would transfer the way I’m imagining).
there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the “glue” takes <1% of the comput[ing] used to train the independent models?
Correct. They’re two entirely different models. There’s no way they could interoperate without massive computing and building a new model.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
There’s no way they could interoperate without massive computing and building a new model.
It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don’t have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model—you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
Yeah, I did mean “computing power” there. I think it’s just a weird way that people in my industry use words.[3]
Example: DeepMind’s Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
For example, see this HN discussion about it. See also the “compute” section of this post, which talks about things that are “compute-bound” rather than “bounded on the amount of available computing power”.
Why waste time use lot word when few word do trick?
(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.
I don’t think you can train one transformer on a dataset that doesn’t contain any mentions of the fact X but mentions fact Y, then train the second transformer on a dataset that contains Y but not X, and then easily share the knowledge of X and Y between them
Let’s say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the “glue” takes <1% of the compute used to train the independent models?
I weakly expect the opposite, largely based on stuff like this, and based on playing around with using algebraic value editing to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there’s no guarantee that result scales or would transfer the way I’m imagining).
Correct. They’re two entirely different models. There’s no way they could interoperate without massive computing and building a new model.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don’t have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model—you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
Yeah, I did mean “computing power” there. I think it’s just a weird way that people in my industry use words.[3]
Example: DeepMind’s Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
Per Linearly Mapping from Image to Text Space
For example, see this HN discussion about it. See also the “compute” section of this post, which talks about things that are “compute-bound” rather than “bounded on the amount of available computing power”.
Why waste time use lot word when few word do trick?
(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.