James Camacho comments on Memory bandwidth constraints imply economies of scale in AI inference

James Camacho 17 Dec 2023 0:45 UTC
4 points
0
The problem with etching specific models is scale. It costs around $1M to design a custom chip mask, so it needs to be amortized over tens or hundreds of thousands of chips to become profitable. But no companies need that many.
Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.
Even OpenAI would only need hundreds, maybe thousands of chips. The solution is smaller-scale chip production. There are startups working on electron beam lithography, but I’m unaware of a retailer Etched could buy from right now.
EDIT: 3 trillion flops/token (similar to GPT-4) is 3e12, so that would be 100,000 chips. The scale is actually there.
- gwern 17 Dec 2023 19:55 UTC
  5 points
  0
  Parent
  If you read through the podcast, which is the only material I could quickly find laying out the Etched paradigm in any kind of detail, their argument seems to be that they can improve the workflow and easily pay for a trivial $1m (which is what, a measly 20 H100 GPUs?), and that, as AI eats the global white-collar economy, inference costs is the main limit and the main obstacle to justifying the training runs for even more powerful models (it does you little good to create GPT-5 if you can’t then inference it at a competitive cost), and so plenty of companies actually would need or buy such chips, and many would find it worthwhile to make their own by finetuning on a company-wide corpus (akin to BlombergGPT).
  
  At current economics, it might not make sense, sure; but they are big believers in the future, and point to other ways to soak up that compute: tree search, specifically. (You may not need that many GPT-4 tokens, because of its inherent limitations, so burning it onto a chip to make it >100x cheaper doesn’t do you much good, but if you can figure out how to do MCTS to make it the equivalent of GPT-6 at the same net cost...)
  
  I’m not sure how much I believe their proprietary simulations claiming such speedups, and I’d definitely be concerned about models changing so fast* that this doesn’t make any sense to do for the foreseeable future given all of the latencies involved (how useful would a GPT-2 ASIC be today, even if you could run it for free at literally $0/token?), so this strikes me as a very gutsy bet but one that could pay off—there are many DL hardware startups, but I don’t know of anyone else seriously pursuing the literally-make-a-NN-ASIC idea.
  
  * right now, the models behind the big APIs like Claude or ChatGPT change fairly regularly. Obviously, you can’t really do that with an ASIC which has burned in the weights… so you would either have to be very sure you don’t want to update the model any time soon or you have to figure out some way to improve it, like pipelining models, perhaps, or maybe leaving in unused transistors which can be WORMed to periodically add in ‘update layers’ akin to lightweight finetuning of individual layers. If you believe burned-in ASICs are the future, similar to Hinton’s ‘mortal nets’, this would be a very open and almost untouched area of research: how to best ‘work around’ an ASIC being inherently WORM.
  - gwern 25 Jun 2024 17:52 UTC
    4 points
    0
    Parent
    They appear to have launched ‘Sohu’, for LLaMA-3-70b: https://www.etched.com/announcing-etched
- ryan_greenblatt 17 Dec 2023 1:09 UTC
  2 points
  0
  Parent
  
  Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.
  
  These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)
  - ryan_greenblatt 17 Dec 2023 1:12 UTC
    5 points
    0
    Parent
    More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for $1 billion, then spending $1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).