My other comment was bearish, but in the bullish direction, I’m surprised Zvi didn’t include any of Gwern’s threads, like this or this, which apropos of Karpathy’s blind test I think have been the best clear examples of superior “taste” or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime’s weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.
We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it’s nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.
My other comment was bearish, but in the bullish direction, I’m surprised Zvi didn’t include any of Gwern’s threads, like this or this, which apropos of Karpathy’s blind test I think have been the best clear examples of superior “taste” or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime’s weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.
We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it’s nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.