Adrian Kelly comments on New fast transformer inference ASIC — Sohu by Etched

Adrian Kelly 1 Jul 2024 8:23 UTC
4 points
1
They claim that by specializing the chips for transformer workloads and removing the programmability of GPUs, they can fit an order of magnitude more compute FLOPs on the same size chip, which is plausible. But common wisdom is that LLMs are memory bandwidth limited. Model Bandwidth Utilization in inference workloads is often 60-80%, which would indicate that Nvidia’s chips are reasonably well balanced in their ratio of bandwidth to compute, and that here isn’t a ton of performance to be gained by just increasing compute. The Sohu chip reportedly has 144GB of HBM3E memory, the same type of memory as the Nvidia B200, with 0.75x as much memory capacity and bandwidth. Compared to the H100, the Sohu has 1.8x the memory capacity and bandwidth. They claim that performance is 20x that of an H100, which seems hard to believe based on the memory bandwidth. But in the Sohu post, they claim that it’s a misconception that inference is memory bandwidth limited. If I’m understanding it correctly, increasing batch sizes reduces the bandwidth to compute ratio, so you can tune the bandwidth to compute ratio of the workload to match your hardware, but at the cost of latency. But maybe I’m missing something, if you have experience in this field please chime in on whether you think memory bandwidth will be a constraint.
Also, sdmat on reddit claims that MoE and long context length models require much more memory bandwidth, which would be bad for Sohu.
The compute die is on the TSMC 4nm process, same as the B200. Die size from the photo looks like it’s at the reticle size limit, compared to B200 which uses 2 dies at the reticle limit. So, even if the Sohu chips are memory bandwidth limited, they should be ~20-30% cheaper to produce in terms of $/memory bandwidth, and much more energy efficient than Nvidia’s B200. However they only support transformers (and there’s a separate variant for MoE models), and if AI architectures shift then it would take Etched around 3 years to be able to launch a new chip accommodate the new architecture. If this style of chip becomes dominant, it would create a degree of lock-in to the transformer architecture and make it more difficult to switch to new architectures.
Their website is just renders not real photos, so I’m pretty sure they don’t have chips made yet, and the performance numbers are theoretical and could be way off. But they just announced a $120M fundraise, so they should have enough funding to see this chip across the finish line. I made a market on Manifold on whether they will ship these within a year. I think I’m selling some of my Nvidia stock though.
- Tao Lin 3 Jul 2024 20:29 UTC
  2 points
  0
  Parent
  the reason why etched was less bandwidth limited is they traded latency for throughput by batching prompts and completions together. Gpus could also do that but they don’t to improve latency