quanticle comments on The surprising parameter efficiency of vision models

quanticle 9 Apr 2023 6:13 UTC
3 points
−1
That’s a fair criticism, but why would it apply to only language models? We also train visual models with a randomized curriculum, and we seem to get much better results. Why would randomization hurt training efficiency for language generation but not image generation?
- anonymousaisafety 9 Apr 2023 8:18 UTC
  10 points
  1
  Parent
  First, when we say “language model” and then we talk about the capabilities of that model for “standard question answering and factual recall tasks”, I worry that we’ve accidentally moved the goal posts on what a “language model” is.
  Originally, a language model was a stochastic parrot. They were developed to answer questions like “given these words, what comes next?” or “given this sentence, with this unreadable word, what is the most likely candidate?” or “what are the most common words?”^[1] It was not a problem that required deep learning.
  Then, we applied deep learning to it, because the path of history so far has been to take straightforward algorithms, replace them with a neural network, and see what happens. From that, we got … stochastic parrots! Randomizing the data makes perfect sense for that.
  Then, we scaled it. And we scaled it more. And we scaled it more.
  And now we’ve arrived at a thing we keep calling a “language model” due to history, but it isn’t a stochastic parrot anymore.
  Second, I’m not saying “don’t randomize data”, I’m saying “use a tiered approach to training”. We would use all of the same techniques: randomization, masking, adversarial splits, etc. What we would not do is throw all of our data and all of our parameters into a single, monolithic model and expect that would be efficient.^[2] Instead, we’d first train a “minimal” LLM, then we’d use that LLM as a component within a larger NN, and we’d train that combined system (LLM + NN) on all of the test cases we care about for abstract reasoning / problem solving / planning / etc. It’s that combined system that I think would end up being vastly more efficient than current language models, because I suspect the majority of language model parameters are being used for embedding trivia that doesn’t contribute to the core capabilities we recognize as “general intelligence”.
  1. ^
    This wasn’t for auto-complete, it was generally for things like automatic text transcription from images, audio, or videos. Spam detection was another use-case.
  2. ^
    Recall that I’m trying to offer a hypothesis for why a system like GPT-3.5 takes so much training and has so many parameters and it still isn’t “competent” in all of the ways that a human is competent. I think “it is being trained in an inefficient way” is a reasonable answer to that question.
  - quanticle 9 Apr 2023 10:15 UTC
    3 points
    1
    Parent
    Okay, that’s all fair, but it still doesn’t answer my question. We don’t do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?
    
    Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?
    - anonymousaisafety 9 Apr 2023 19:17 UTC
      6 points
      1
      Parent
      We don’t do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?
      I’m not sure that “even higher level of output quality” is actually true, but I recognize that it can be difficult to judge when an image generation model has succeeded. In particular, I think current image models are fairly bad at specifics in much the same way as early language models.
      But I think the real problem is that we seem to still be stuck on “words”. When I ask GPT-4 a logic question, and it produces a grammatically correct sentence that answers the logic puzzle correctly, only part of that is related to “words”—the other part is a nebulous blob of reasoning.
      I went all the way back to GPT-1 (117 million parameters) and tested next word prediction—specifically, I gave a bunch of prompts, and I looked for only if the very next word was what I would have expected. I think it’s incredibly good at that! Probably better than most humans.
      Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?
      No, because this is already how image generators work. That’s what I said in my first post when I noted the architectural differences between image generators and language models. An image generator, as a system, consists of multiple models. There is a text → image space, and then an image space → image. The text → image space encoder is generally trained first, then it’s normally frozen during the training of the image decoder.^[1] Meanwhile, the image decoder is trained on a straightforward task: “given this image, predict the noise that was added”. In the actual system, that decoder is put into a loop to generate the final result. I’m requoting the relevant section of my first post below:
      The reason why I’m discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like “scale is all you need”. I think it’s obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.
      This is also why I think that point #3 is relevant. If GPT-3 does so well because it’s using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.
      ^
      Refer to figure 2 in https://cdn.openai.com/papers/dall-e-2.pdf. Or read this:
      The trick here is that they decoupled the encoding from training the diffusion model. That way, the autoencoder can be trained to get the best image representation and then downstream several diffusion models can be trained on the so-called latent representation
      This is the idea that I’m saying could be applied to language models, or rather, to a thing that we want to demonstrate “general intelligence” in the form of reasoning / problem solving / Q&A / planning / etc. First train a LLM, then train a larger system with the LLM as a component within it.