GPT-4o is literally cheaper.
And you’re probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there’s very deep integration of SSML.
One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we’re probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) is going to lead to significantly improved expressed performance.
I would strongly suggest not evaluating GPT-4o’s overall performance in text only mode without the SSML markup added.
Opus is great, I like that model a lot. But in general I think most of the people looking at this right now are too focused on what’s happening with the networks themselves and not focused enough on what’s happening with the data, particularly around clustering of features across multiple dimensions of the vector space. SAE is clearly picking up only a small sample and even then isn’t cleanly discovering precisely what’s represented.
I’d wait to see what ends up happening with things like CoT in SSML synthetic data.
The current Gemini search summarization failures as well as an unexpected result the other week with humans around a theory of mind variation suggests to me that the more models are leaning into effectively surface statistics for token similarity vs completion based on feature clustering is holding back performance and that cutting through the similarity with formatting differences will lead to a performance leap. This may even be part of why models will frequently be able to get a problem right as a code expression than as a direct answer.
So even if GPT-5 doesn’t arrive, I’d happily bet that we see a very noticable improvement over the next six months, and that’s not even accounting for additional efficiency in prompt techniques. But all this said, I’d also be surprised if we don’t at least see GPT-5 announced by that point.
P.S. Lmsys is arguably the best leaderboard to evaluate real world usage, but it still inherently reflects a sampling bias around what people who visit lmsys ask of models as well as the ways in which they do so. I wouldn’t extrapolate relative performance too far, particularly when minor.
I’m reminded of a quote I love from an apocrypha that goes roughly like this:
Q: How long will suffering rule over humans?
A: As long as women bear children.
Also, there’s the possibility you are already in a digital resurrection of humanity, and thus, if you are worried about s-risks for AI, death wouldn’t necessarily be an escape but an acceleration. So the wisest option would be maximizing your time when suffering is low as inescapable eternal torture could be just around the corner when these precious moments pass you by (and you wouldn’t want to waste them by stressing about tomorrow during the limited number of todays you have).
But on an individualized basis, even if AI weren’t a concern, everyone faces significant s-risks towards end of life. An accident could put any person into a situation where unless they have the proper directives they could spend years suffering well beyond most people’s expectations. So if extended suffering is a concern, do look into that paperwork (the doctors I know cry most not about the healthy that get sick but the unhealthy kept alive by well meaning but misguided family).
I would argue that there’s very, very low chances of an original human capably being kept meaningfully alive to torture for eternity though. And there’s a degree of delusion of grandeur that an average person would have the insane resources necessary to extend life indefinitely spent on them just to torture them.
There’s probably better things to worry about, and even then there’s probably better things to do than worry with the limited time you do have in a non-eternal existence.