The discourse around this model would benefit a lot from (a greater number of) specific examples where the GPT-4.5 response is markedly and interestingly different from the response of some reference model.
Karpathy’s comparisons are a case in point (of the absence I’m referring to). Yes, people are vehemently disputing which responses were better, and whether the other side has “bad taste”… but if you didn’t know what the context was, the most obvious property of the pairs would be how similar they are.
And how both options are bad (unfunny standup, unmetrical or childish poetry), and how they are both bad in basically the same way.
Contrast this with the GPT-3 and GPT-4 releases: in those cases people had no trouble finding many, many examples of obviously distinctive behavior from the new model, and these were rapidly and profusely shared in the usual venues.
As Karpathy says, with GPT-4 it was “subtler” than it had been before, at least in some sense. But the difference was not that there weren’t any clear examples of better or different behavior – it was just that the cases where the new model behaved very differently tended to be obscure or tricky or otherwise “off the beaten path” somehow, so that if you weren’t actively looking for them, the user experience could feel deceptively similar to the one we had with earlier models.
But we were actively looking for those special cases, and we had no trouble finding them.
For instance, looking through my blog archives, I find this thread from shortly after the GPT-4 release, highlighting some puzzle-like questions that GPT-3.5 failed and GPT-4 aced. Summing up the trend, I wrote:
Subjectively, I’ve found that GPT-4 feels much more “attentive” and harder to trick than GPT-3.5.
When I’ve seen it make errors, they usually involves things on the edges of its knowledge – topics that are either academically advanced, or just not very widely known.
[...]
These cases are kind of tricky to discover.
On the one hand, GPT-4 does know a lot of stuff, including obscure stuff – this was the first obvious difference I noticed from GPT-3.5, and I later saw I wasn’t alone in that.
So you have to hunt for things obscure enough that it won’t know them. But if you start asking for really obscure stuff, it will often telling you (whether rightly or wrongly) that it doesn’t know the answer.
There’s still a “wedge” of cases where it will start confidently blabbing about something it doesn’t really understand, but the wedge has gotten much narrower.
Maybe the “wedge” was already so small before GPT-4.5 that it’s now simply very difficult to find anything that’s still a part of it?
But I dunno, that just doesn’t feel like the right explanation to me. For one thing, GPT-4.5 still gets a lot of (semi-)obscure-knowledge stuff wrong. (In one case I asked it about a piece of rationalist community trivia, and in the course of giving an inaccurate answer, it referred to “the Israeli blogger and activist Eliezer Yudkowsky”… like, come on, lmao.)
I’m open to the idea that this is no different from earlier scale-ups, mutatis mutandis – that it really is dramatically better in certain cases, like GPT-3 and 3.5 and 4 were, and those (perhaps obscure) cases simply haven’t diffused across the community yet.
But all of this “taste” stuff, all of this stuff where people post bog-standard AI slop and claim it has ineffably better vibes, just feels like an accidental admission of defeat re: the original question. It was never like that with previous scale-ups; we didn’t need “taste” then; in the cases that got highlighted, the difference was obvious.
(OTOH, if you look at two models that are differently scaled, but not “enough” – like just a 2x compute difference, say – typically it will be very hard to find unequivocal wins for the bigger model, with the latter winning at most in some vague aggregate vibes sense. One might then argue that this reflects something about the concave shape of the “log-compute vs. noticeable behavior” curve: 10x is the new 2x, and only with even more scale will we get something for which obvious wins are easy to evince.)
I think most of the trouble is conflating recent models like GPT-4o with GPT-4, when they are instead ~GPT-4.25. It’s plausible that some already use 4x-5x compute of original GPT-4 (an H100 produces 3x compute of an A100), and that GPT-4.5 uses merely 3x-4x more compute than any of them. The distance between them and GPT-4.5 in raw compute might be quite small.
It shouldn’t be at all difficult to find examples where GPT-4.5 is better than the actual original GPT-4 of March 2023, it’s not going to be subtle. Before ChatGPT there were very few well-known models at each scale, but now the gaps are all filled in by numerous models of intermediate capability. It’s the sorites paradox, not yet evidence of slowdown.
Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with.
Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, which replaced the full gpt-4 on chatgpt very quickly, and forget that there were many complaints back then that the faster (shrinking) model iterations were losing the “big model smell”, despite climbing the benchmarks.
And so when lots of people seem to describe 4.5′s advantages vs 4o as coming down to a “big model smell”, I think it is important to remember 4turbo and later 4o are clearly optimized for speed, price and benchmarks far more than original release gpt-4 was, and comparisons on taste/aesthetics/intangibles may be more fitting when using the original, non-goodharted, full scale gpt-4 model. At the very least, it should fully and properly represent what it looks like to have a clean ~10x less training compute vs 4.5.
As the model updates grow more dense I also check out; a large jump in capabilities between the original gpt-4 and gpt-4.5 would remain salient to me. This is not salient.
My other comment was bearish, but in the bullish direction, I’m surprised Zvi didn’t include any of Gwern’s threads, like this or this, which apropos of Karpathy’s blind test I think have been the best clear examples of superior “taste” or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime’s weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.
We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it’s nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.
The discourse around this model would benefit a lot from (a greater number of) specific examples where the GPT-4.5 response is markedly and interestingly different from the response of some reference model.
Karpathy’s comparisons are a case in point (of the absence I’m referring to). Yes, people are vehemently disputing which responses were better, and whether the other side has “bad taste”… but if you didn’t know what the context was, the most obvious property of the pairs would be how similar they are.
And how both options are bad (unfunny standup, unmetrical or childish poetry), and how they are both bad in basically the same way.
Contrast this with the GPT-3 and GPT-4 releases: in those cases people had no trouble finding many, many examples of obviously distinctive behavior from the new model, and these were rapidly and profusely shared in the usual venues.
As Karpathy says, with GPT-4 it was “subtler” than it had been before, at least in some sense. But the difference was not that there weren’t any clear examples of better or different behavior – it was just that the cases where the new model behaved very differently tended to be obscure or tricky or otherwise “off the beaten path” somehow, so that if you weren’t actively looking for them, the user experience could feel deceptively similar to the one we had with earlier models.
But we were actively looking for those special cases, and we had no trouble finding them.
For instance, looking through my blog archives, I find this thread from shortly after the GPT-4 release, highlighting some puzzle-like questions that GPT-3.5 failed and GPT-4 aced. Summing up the trend, I wrote:
Maybe the “wedge” was already so small before GPT-4.5 that it’s now simply very difficult to find anything that’s still a part of it?
But I dunno, that just doesn’t feel like the right explanation to me. For one thing, GPT-4.5 still gets a lot of (semi-)obscure-knowledge stuff wrong. (In one case I asked it about a piece of rationalist community trivia, and in the course of giving an inaccurate answer, it referred to “the Israeli blogger and activist Eliezer Yudkowsky”… like, come on, lmao.)
I’m open to the idea that this is no different from earlier scale-ups, mutatis mutandis – that it really is dramatically better in certain cases, like GPT-3 and 3.5 and 4 were, and those (perhaps obscure) cases simply haven’t diffused across the community yet.
But all of this “taste” stuff, all of this stuff where people post bog-standard AI slop and claim it has ineffably better vibes, just feels like an accidental admission of defeat re: the original question. It was never like that with previous scale-ups; we didn’t need “taste” then; in the cases that got highlighted, the difference was obvious.
(OTOH, if you look at two models that are differently scaled, but not “enough” – like just a 2x compute difference, say – typically it will be very hard to find unequivocal wins for the bigger model, with the latter winning at most in some vague aggregate vibes sense. One might then argue that this reflects something about the concave shape of the “log-compute vs. noticeable behavior” curve: 10x is the new 2x, and only with even more scale will we get something for which obvious wins are easy to evince.)
I think most of the trouble is conflating recent models like GPT-4o with GPT-4, when they are instead ~GPT-4.25. It’s plausible that some already use 4x-5x compute of original GPT-4 (an H100 produces 3x compute of an A100), and that GPT-4.5 uses merely 3x-4x more compute than any of them. The distance between them and GPT-4.5 in raw compute might be quite small.
It shouldn’t be at all difficult to find examples where GPT-4.5 is better than the actual original GPT-4 of March 2023, it’s not going to be subtle. Before ChatGPT there were very few well-known models at each scale, but now the gaps are all filled in by numerous models of intermediate capability. It’s the sorites paradox, not yet evidence of slowdown.
Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with.
Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, which replaced the full gpt-4 on chatgpt very quickly, and forget that there were many complaints back then that the faster (shrinking) model iterations were losing the “big model smell”, despite climbing the benchmarks.
And so when lots of people seem to describe 4.5′s advantages vs 4o as coming down to a “big model smell”, I think it is important to remember 4turbo and later 4o are clearly optimized for speed, price and benchmarks far more than original release gpt-4 was, and comparisons on taste/aesthetics/intangibles may be more fitting when using the original, non-goodharted, full scale gpt-4 model. At the very least, it should fully and properly represent what it looks like to have a clean ~10x less training compute vs 4.5.
Hard disagree, this is evidence of slowdown.
As the model updates grow more dense I also check out; a large jump in capabilities between the original gpt-4 and gpt-4.5 would remain salient to me. This is not salient.
My other comment was bearish, but in the bullish direction, I’m surprised Zvi didn’t include any of Gwern’s threads, like this or this, which apropos of Karpathy’s blind test I think have been the best clear examples of superior “taste” or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime’s weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.
We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it’s nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.