Hastings comments on New fast transformer inference ASIC — Sohu by Etched

Hastings 26 Jun 2024 11:57 UTC
3 points
0
Basically, the claims in the linked post that LLM inference is compute bound, and that a modern nvidia chip inferring LLaMa only achieves 30% utilization, seem extraordinarily unlikely to both be true.
- lemonhope 29 Jun 2024 19:06 UTC
  1 point
  0
  Parent
  I was pretty confused why they didn’t focus on memory instead of flops. Maybe it was just a bad bet? Who would have made that bet? IIRC, memory speed was widely known to be The Thing for at least five years.
  
  Pretty sure you can improve on memory speed tremendously if the order-of-access is known before design. Weird they apparently didn’t do that¿