gwern comments on Memory bandwidth constraints imply economies of scale in AI inference

gwern 17 Dec 2023 0:19 UTC
7 points
0
Etched.ai is a new DL ASIC startup. Their big idea (podcast interview) seems to be to burn specific Transformer models into the ASIC, and then make full use of the compute while using a specific parameter by running very large batches in parallel:

While we run a transformer model, we have this huge number of parameters. And each one of these parameters is a number. And to use that number, we take in a number from our input. We multiply them together, and we add them to a running total.

So every one of those parameters, in the case of GPT-3, 175 billion, is loaded from memory once and then used in the math operation once. It turns out that loading a thing from memory is way more expensive than doing the math. So how do we solve this problem? Well, we say that these weights are the same of across one user or two users or four users or eight users. So you batch together a huge number of queries. Then we load in that weight once, and we use it 16, 32, 64 times.

And that’s one of the really interesting things that a transformer ASIC can do. You can have a much, much larger batch, not 64 but 2,500. So we’re able to go load that weight in once, pay the expensive price and then amortize that expensive price over a huge number of users, making inference much, much cheaper. And now while this sounds good in theory, this does mean that you have to run that model in a place where you can have a huge number of users all kind of grouped together. So I think that means inference will be centralized in much the same way as training.
What links here?
- gwern's comment on Templarrr’s Shortform by Templarrr (15 Apr 2024 15:48 UTC; 4 points)
- James Camacho 17 Dec 2023 0:45 UTC
  4 points
  0
  Parent
  The problem with etching specific models is scale. It costs around $1M to design a custom chip mask, so it needs to be amortized over tens or hundreds of thousands of chips to become profitable. But no companies need that many.
  Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.
  Even OpenAI would only need hundreds, maybe thousands of chips. The solution is smaller-scale chip production. There are startups working on electron beam lithography, but I’m unaware of a retailer Etched could buy from right now.
  EDIT: 3 trillion flops/token (similar to GPT-4) is 3e12, so that would be 100,000 chips. The scale is actually there.
  - gwern 17 Dec 2023 19:55 UTC
    5 points
    0
    Parent
    If you read through the podcast, which is the only material I could quickly find laying out the Etched paradigm in any kind of detail, their argument seems to be that they can improve the workflow and easily pay for a trivial $1m (which is what, a measly 20 H100 GPUs?), and that, as AI eats the global white-collar economy, inference costs is the main limit and the main obstacle to justifying the training runs for even more powerful models (it does you little good to create GPT-5 if you can’t then inference it at a competitive cost), and so plenty of companies actually would need or buy such chips, and many would find it worthwhile to make their own by finetuning on a company-wide corpus (akin to BlombergGPT).
    
    At current economics, it might not make sense, sure; but they are big believers in the future, and point to other ways to soak up that compute: tree search, specifically. (You may not need that many GPT-4 tokens, because of its inherent limitations, so burning it onto a chip to make it >100x cheaper doesn’t do you much good, but if you can figure out how to do MCTS to make it the equivalent of GPT-6 at the same net cost...)
    
    I’m not sure how much I believe their proprietary simulations claiming such speedups, and I’d definitely be concerned about models changing so fast* that this doesn’t make any sense to do for the foreseeable future given all of the latencies involved (how useful would a GPT-2 ASIC be today, even if you could run it for free at literally $0/token?), so this strikes me as a very gutsy bet but one that could pay off—there are many DL hardware startups, but I don’t know of anyone else seriously pursuing the literally-make-a-NN-ASIC idea.
    
    * right now, the models behind the big APIs like Claude or ChatGPT change fairly regularly. Obviously, you can’t really do that with an ASIC which has burned in the weights… so you would either have to be very sure you don’t want to update the model any time soon or you have to figure out some way to improve it, like pipelining models, perhaps, or maybe leaving in unused transistors which can be WORMed to periodically add in ‘update layers’ akin to lightweight finetuning of individual layers. If you believe burned-in ASICs are the future, similar to Hinton’s ‘mortal nets’, this would be a very open and almost untouched area of research: how to best ‘work around’ an ASIC being inherently WORM.
    - gwern 25 Jun 2024 17:52 UTC
      4 points
      0
      Parent
      They appear to have launched ‘Sohu’, for LLaMA-3-70b: https://www.etched.com/announcing-etched
  - ryan_greenblatt 17 Dec 2023 1:09 UTC
    2 points
    0
    Parent
    
    Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.
    
    These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)
    - ryan_greenblatt 17 Dec 2023 1:12 UTC
      5 points
      0
      Parent
      More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for $1 billion, then spending $1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).