Specifically, their claim is “2x faster, half the price, and has 5x higher rate limits”. For voice, “232 milliseconds, with an average of 320 milliseconds” down from 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. I think there are people with API access who are validating this claim on their workloads, so more data should trickle in soon. But I didn’t like seeing Whisper v3 being compared to 16-shot GPT-4o, that’s not a fair comparison for WER, and I hope it doesn’t catch on.
If you want to try it yourself you can use ELAN, which is the tool used in the paper they cite for human response times. I think if you actually ran this test, you would find a lot of inconsistency with large differences between min vs max response time, average hides a lot vs a latency profile generated by e.g HdrHistogram. Auditory signals reach central processing systems within 8-10ms, but visual stimulus can take around 20-40ms, so there’s still room for 1-2 OOM of latency improvement.
LLM inference is not as well studied as training, so there’s lots of low hanging fruit when it comes to optimization (at first bottlenecked on memory bandwidth, post quantization, on throughput and compute within acceptable latency envelopes), plus there’s a lot of pressure to squeeze out extra efficiency given constraints on hardware.
Llama-2 came out in July 2023, by September there were so many articles coming out on inference tricks I created a subreddit to keep track of high quality ones, though I gave up by November. At least some of the improvement is from open source code making it back into the major labs. The trademark for GPT-5 was registered in July (and included references to audio being built in), updated in February, and in March they filed to use “Voice Engine” which seems about right for a training run. I’m not aware of any publicly available evidence which contradicts the hypothesis that GPT-5 would just be a scaled up version of this architecture.
Specifically, their claim is “2x faster, half the price, and has 5x higher rate limits”. For voice, “232 milliseconds, with an average of 320 milliseconds” down from 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. I think there are people with API access who are validating this claim on their workloads, so more data should trickle in soon. But I didn’t like seeing Whisper v3 being compared to 16-shot GPT-4o, that’s not a fair comparison for WER, and I hope it doesn’t catch on.
If you want to try it yourself you can use ELAN, which is the tool used in the paper they cite for human response times. I think if you actually ran this test, you would find a lot of inconsistency with large differences between min vs max response time, average hides a lot vs a latency profile generated by e.g HdrHistogram. Auditory signals reach central processing systems within 8-10ms, but visual stimulus can take around 20-40ms, so there’s still room for 1-2 OOM of latency improvement.
LLM inference is not as well studied as training, so there’s lots of low hanging fruit when it comes to optimization (at first bottlenecked on memory bandwidth, post quantization, on throughput and compute within acceptable latency envelopes), plus there’s a lot of pressure to squeeze out extra efficiency given constraints on hardware.
Llama-2 came out in July 2023, by September there were so many articles coming out on inference tricks I created a subreddit to keep track of high quality ones, though I gave up by November. At least some of the improvement is from open source code making it back into the major labs. The trademark for GPT-5 was registered in July (and included references to audio being built in), updated in February, and in March they filed to use “Voice Engine” which seems about right for a training run. I’m not aware of any publicly available evidence which contradicts the hypothesis that GPT-5 would just be a scaled up version of this architecture.