Basically, the claims in the linked post that LLM inference is compute bound, and that a modern nvidia chip inferring LLaMa only achieves 30% utilization, seem extraordinarily unlikely to both be true.
I was pretty confused why they didn’t focus on memory instead of flops. Maybe it was just a bad bet? Who would have made that bet? IIRC, memory speed was widely known to be The Thing for at least five years.
Pretty sure you can improve on memory speed tremendously if the order-of-access is known before design. Weird they apparently didn’t do that¿
Crypto asics fundamentally didn’t need memory bandwidth. Modern GPUs are basically memory bandwidth asics already.
Basically, the claims in the linked post that LLM inference is compute bound, and that a modern nvidia chip inferring LLaMa only achieves 30% utilization, seem extraordinarily unlikely to both be true.
I was pretty confused why they didn’t focus on memory instead of flops. Maybe it was just a bad bet? Who would have made that bet? IIRC, memory speed was widely known to be The Thing for at least five years.
Pretty sure you can improve on memory speed tremendously if the order-of-access is known before design. Weird they apparently didn’t do that¿