One 8xSohu server replaces 160 H100 GPUs.
Benchmarks are for Llama-3 70B in FP8 precision, 2048 input/128 output lengths.
What would happen if AI models get 20x faster and cheaper overnight?
So there is an oblique claim that they might potentially offer 20x cheaper inference in a setup with unknown affordances. Can it run larger models, or use more context? Is generation latency reasonable and at which cost?
The claims of being “faster” and “500k tokens per second” are about throughput per black box with unspecified characteristics, so in isolation meaningless. You can correctly say exactly the same thing about “speed” for Llama-3 70B inference using giant black boxes powered by a sufficient number of Pentium 4.
So there is an oblique claim that they might potentially offer 20x cheaper inference in a setup with unknown affordances. Can it run larger models, or use more context? Is generation latency reasonable and at which cost?
The claims of being “faster” and “500k tokens per second” are about throughput per black box with unspecified characteristics, so in isolation meaningless. You can correctly say exactly the same thing about “speed” for Llama-3 70B inference using giant black boxes powered by a sufficient number of Pentium 4.