there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the “glue” takes <1% of the comput[ing] used to train the independent models?
Correct. They’re two entirely different models. There’s no way they could interoperate without massive computing and building a new model.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
There’s no way they could interoperate without massive computing and building a new model.
It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don’t have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model—you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
Yeah, I did mean “computing power” there. I think it’s just a weird way that people in my industry use words.[3]
Example: DeepMind’s Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
For example, see this HN discussion about it. See also the “compute” section of this post, which talks about things that are “compute-bound” rather than “bounded on the amount of available computing power”.
Why waste time use lot word when few word do trick?
Correct. They’re two entirely different models. There’s no way they could interoperate without massive computing and building a new model.
(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don’t have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model—you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
Yeah, I did mean “computing power” there. I think it’s just a weird way that people in my industry use words.[3]
Example: DeepMind’s Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
Per Linearly Mapping from Image to Text Space
For example, see this HN discussion about it. See also the “compute” section of this post, which talks about things that are “compute-bound” rather than “bounded on the amount of available computing power”.
Why waste time use lot word when few word do trick?