I would bet that ASICs will run the roost in a few years and this is only the beginning.
They claim 500k tokens per second with Llama 70B.
Seems to be exactly what it looks like, an ASIC. Curious if this is somehow not what it looks like.
I would bet that ASICs will run the roost in a few years and this is only the beginning.
They claim 500k tokens per second with Llama 70B.
Seems to be exactly what it looks like, an ASIC. Curious if this is somehow not what it looks like.
Electrical engineer here. I read the publicity statement, and from my point of view it is both (a) a major advance, if true, and (b) entirely plausible. When you switch from a programmable device (e.g. GPU) to a similarly sized special purpose ASIC, it is not unreasonable to pick up a factor of 10 to 50 in performance. The tradeoff is that the GPU can do many more things than the ASIC, and the ASIC takes years to design. They claim they started design in 2022, on a transformer-only device, on the theory that transformers were going to be popular. And boy, did they luck out. I don‘t know if other people can tell, but to me, that statement oozes with engineering glee. They’re so happy!
I would love to see a technical paper on how they did it.
Of course they may be lying.
So there is an oblique claim that they might potentially offer 20x cheaper inference in a setup with unknown affordances. Can it run larger models, or use more context? Is generation latency reasonable and at which cost?
The claims of being “faster” and “500k tokens per second” are about throughput per black box with unspecified characteristics, so in isolation meaningless. You can correctly say exactly the same thing about “speed” for Llama-3 70B inference using giant black boxes powered by a sufficient number of Pentium 4.
They claim that by specializing the chips for transformer workloads and removing the programmability of GPUs, they can fit an order of magnitude more compute FLOPs on the same size chip, which is plausible. But common wisdom is that LLMs are memory bandwidth limited. Model Bandwidth Utilization in inference workloads is often 60-80%, which would indicate that Nvidia’s chips are reasonably well balanced in their ratio of bandwidth to compute, and that here isn’t a ton of performance to be gained by just increasing compute. The Sohu chip reportedly has 144GB of HBM3E memory, the same type of memory as the Nvidia B200, with 0.75x as much memory capacity and bandwidth. Compared to the H100, the Sohu has 1.8x the memory capacity and bandwidth. They claim that performance is 20x that of an H100, which seems hard to believe based on the memory bandwidth. But in the Sohu post, they claim that it’s a misconception that inference is memory bandwidth limited. If I’m understanding it correctly, increasing batch sizes reduces the bandwidth to compute ratio, so you can tune the bandwidth to compute ratio of the workload to match your hardware, but at the cost of latency. But maybe I’m missing something, if you have experience in this field please chime in on whether you think memory bandwidth will be a constraint.
Also, sdmat on reddit claims that MoE and long context length models require much more memory bandwidth, which would be bad for Sohu.
The compute die is on the TSMC 4nm process, same as the B200. Die size from the photo looks like it’s at the reticle size limit, compared to B200 which uses 2 dies at the reticle limit. So, even if the Sohu chips are memory bandwidth limited, they should be ~20-30% cheaper to produce in terms of $/memory bandwidth, and much more energy efficient than Nvidia’s B200. However they only support transformers (and there’s a separate variant for MoE models), and if AI architectures shift then it would take Etched around 3 years to be able to launch a new chip accommodate the new architecture. If this style of chip becomes dominant, it would create a degree of lock-in to the transformer architecture and make it more difficult to switch to new architectures.
Their website is just renders not real photos, so I’m pretty sure they don’t have chips made yet, and the performance numbers are theoretical and could be way off. But they just announced a $120M fundraise, so they should have enough funding to see this chip across the finish line. I made a market on Manifold on whether they will ship these within a year. I think I’m selling some of my Nvidia stock though.
the reason why etched was less bandwidth limited is they traded latency for throughput by batching prompts and completions together. Gpus could also do that but they don’t to improve latency
Crypto asics fundamentally didn’t need memory bandwidth. Modern GPUs are basically memory bandwidth asics already.
Basically, the claims in the linked post that LLM inference is compute bound, and that a modern nvidia chip inferring LLaMa only achieves 30% utilization, seem extraordinarily unlikely to both be true.
I was pretty confused why they didn’t focus on memory instead of flops. Maybe it was just a bad bet? Who would have made that bet? IIRC, memory speed was widely known to be The Thing for at least five years.
Pretty sure you can improve on memory speed tremendously if the order-of-access is known before design. Weird they apparently didn’t do that¿
Is there anything useful we can learn from Crypto ASICs as to how this will play out? And specifically, how to actually bet on it?
I think the main way to bet is to find some equity and buy it. Might be hard to find.