SoerenMind comments on Inference cost limits the impact of ever larger models

SoerenMind 5 Nov 2021 11:55 UTC
2 points
Thanks for elaborating I think I know what you mean now. I missed this:

I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.

My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don’t have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer’s weights from RAM at the same as computing the previous layer’s output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.

My original claim was therefore misconceived. I’ll revise it to a different claim: bigger neural nets ought to have higher inference latency in general—regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency. However, adding more layers increases latency, and it’s hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN’s width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it’s not quadratic, I was thinking of the parameter count].) Does that seem right to you?

The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.
- gwern 5 Nov 2021 14:22 UTC
  6 points
  0
  Parent
  
  Width-wise parallelism could help but its communication cost scales unfavorably. It grows quadratically as we grow the NN’s width, and then quadratically again when we try to reduce latency by reducing the number of neurons per GPU.
  
  Incidentally, the latency cost of width vs depth is something I’ve thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could’ve still met the deadline… but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you’ve been eaten by a stupider but faster-thinking predator.
  
  So a biological brain might be forced to be deep into an unfavorable point on width vs depth—which might be extremely expensive—in order to meet its subset of robotics-related deadlines, as it were.
  
  * With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions… but Portia is one of them, as it is a stealthy predator attacking static prey. The prey also don’t generally have much memory nor can they just leave their web, so a Portia can try again if the first trick didn’t work. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it’s fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.
  What links here?
  - Noosphere89's comment on Book Review: Consciousness Explained (as the Great Catalyst) by Rafael Harth (20 Jan 2025 23:49 UTC; 4 points)
  - gwern 7 Feb 2025 21:37 UTC
    16 points
    1
    Parent
    Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?
    
    One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite ‘easy’ in that sense.
    
    It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see ‘large’ robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, the largest robotics are usually <10b, with Gato 1′s ~1b having been, if anything, unusually large because of all the other stuff it was doing. (I’m very behind on the robotics literature so maybe there are now much larger 100b-parameter models as they move into the ‘foundation model’ multi-modal/task scaling paradigm, but I’d bet that there still are none >1,000b.) Even sound/image/video generative models, which would be expected to be much larger than necessary for robotics tasks, are often small enough to run on a single consumer GPU, still. And these are usually trained with scaling laws now, so these are compute-optimal sizes and it is not just that they are wildly under-parameterized (the way almost all models were pre-2020).
    
    So, if robotics is intrinsically easy, but animal brains do not show this because of their latency requirements, which forces them into misleadingly expensive brains, the implication is that we can do robotics by lifting the limitations of biological brains, like being forced to learn in realtime, in the real world, one animal at a time, without any sharing.
    
    We should be able to train deep but small NNs in silico: turning all animal problems into Portia problems, if you will, pausing the simulation to let the NNs think & act for as long as necessary to plan the right action, and only then letting time flow to see what happens, and reset it to try again.
    
    We remove all burdens of wallclock time or caloric consumption or childhood development, which are powerful general robotic controllers, and only then use these teacher-models to optimize low-latency controllers. The wider low-latency student models will be easier to train when they simply must imitate the teacher in a supervised-learning setting instead of RL from scratch, and so the size should be a lot better. (If nothing else, the student models can’t ‘die’ if they make a mistake like breaking a latency constraint, so this learning setting is way easier than an animal’s task.)
    
    On a related note, it is also striking how far down in size LLMs can be pushed. You can get good reasoning out of tiny billion-parameter LLMs trained hard enough on high-quality-enough data, and the ‘densifying experience curve’ is steady and rapid (halving period of ~4 months), so we can expect that at some point we may have superhuman reasoning LLMs in the billion or sub-billion parameter range… which are just very, very ignorant, perhaps even more ignorant than you or me, of all the real-world knowledge & text that a proper LLM has. We can’t train those from scratch, but we can train trillion-parameter LLMs to suck in all the text in the world, and then exhale training data for small fast cheap models.
    
    So it seems that Moravec’s Paradox remains undefeated: as difficult as we find the abstract intellectual capabilities like the process of doing math or reasoning, so difficult we struggle to even write them down to train LLMs on, so difficult to train on we need giant gigawatt datacenters to just get started, they are not intrinsically difficult and in the long run, do not require big expensive NNs.
    - jacob_cannell 14 Feb 2025 2:01 UTC
      9 points
      0
      Parent
      The effectiveness of weight sharing (and parameter compression in general) diminishes as you move the domain from physics (simple rules/patterns tiled over all of space/time) up to language/knowledge (downstream facts/knowledge that are far too costly to rederive from simulation).
      
      BNNs cant really take advantage of weight sharing so much, so ANNs that are closer to physics should be much smaller parameter wise, for the same compute and capability. Which is what we observer for lower level sensor/motor modalities.
    - SoerenMind 6 Mar 2025 11:48 UTC
      2 points
      0
      Parent
      Good points here.
      
      Btw I sometimes think back to how your 3y old comments on this post have aged well.
    - Noosphere89 7 Feb 2025 21:47 UTC
      2 points
      0
      Parent
      It might be at this point just an underinvestment in robotics, compared to other AI.
      
      Admittedly, Gato didn’t have positive transfer, unlike all the other robotic elements.
- TLW 11 Nov 2021 3:38 UTC
  2 points
  Parent
  I am glad we were able to work out the matter!
  
  > If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
  
  Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...
  
  (There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))
  
  According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)
  
  That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)
  
  (In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)
  
  (It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)
  
  (Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
  
  > As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.
  
  Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.
  - SoerenMind 11 Nov 2021 18:42 UTC
    1 point
    Parent
    
    Beware bandwidth bottlenecks, as I mentioned in my original post.
    
    Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
    
    (Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
    
    Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
    - TLW 14 Nov 2021 19:59 UTC
      1 point
      Parent
      Presumably bandwidth requirements can be reduced a lot through width-wise parallelism.
      Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
      Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
      Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
      Hence: beware bandwidth bottlenecks.
      - SoerenMind 15 Nov 2021 10:17 UTC
        2 points
        Parent
        My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
        
        (As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)