OpenAI releases GPT-4o, natively interfacing with text, voice and vision

Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that’s a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
It’s also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
It’s confirmed it was the “gpt2” model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
They’ll be gradually releasing it for everyone, even free users.
Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. “For now”, audio output is limited to a selection of preset voices (addressing audio impersonations).
The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they’re moving towards.
No competitor of Google search, as had been rumored.