Blueberry comments on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better?

Blueberry 10 Jun 2023 2:33 UTC
2 points
0

there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the “glue” takes <1% of the comput[ing] used to train the independent models?

Correct. They’re two entirely different models. There’s no way they could interoperate without massive computing and building a new model.

(Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
- faul_sname 10 Jun 2023 3:07 UTC
  6 points
  4
  Parent
  There’s no way they could interoperate without massive computing and building a new model.
  It historically has been shown that one can interpolate between a vision model and a language model^[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don’t have to do that and in fact it works fine^[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
  
  I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model—you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
  (Aside: was that a typo, or did you intend to say “compute” instead of “computing power”?)
  Yeah, I did mean “computing power” there. I think it’s just a weird way that people in my industry use words.^[3]
  1. ^
    Example: DeepMind’s Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
  2. ^
    Per Linearly Mapping from Image to Text Space
  3. ^
    For example, see this HN discussion about it. See also the “compute” section of this post, which talks about things that are “compute-bound” rather than “bounded on the amount of available computing power”.
    
    Why waste time use lot word when few word do trick?