Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that’s a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
It’s also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
It’s confirmed it was the “gpt2” model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
They’ll be gradually releasing it for everyone, even free users.
Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. “For now”, audio output is limited to a selection of preset voices (addressing audio impersonations).
The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they’re moving towards.
No competitor of Google search, as had been rumored.
OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Link post
Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that’s a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
It’s also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
It’s confirmed it was the “gpt2” model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
They’ll be gradually releasing it for everyone, even free users.
Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. “For now”, audio output is limited to a selection of preset voices (addressing audio impersonations).
The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they’re moving towards.
No competitor of Google search, as had been rumored.