>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference)
Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don’t know what you mean by this?
Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing gpu paradigms).
Was there some “bug” with the hardware scheduler or some low level memory system, or perhaps an issue with the sparsity implementation that I was unaware off? There were of course general refinements across the board for memory architecture, but nothing I’d consider groundbreaking enough to call it “fixing a bug”.
I reskimmed through hopper/blackwell whitepapers and LLM/DR queried and really not sure what you are referring to. If anything, there appear to be some rough edges introduced with NV-HBI and relying on a virtualized monolithic gpu in code vs the 2x die underlying. Or are you perhaps arguing that going MCM and beating the reticle limit was itself the one time thing?
Paragox
This is the most moving piece I’ve read since Situational Awareness. Bravo! Emotionally, I was particularly moved by the final two sentences in the “race” ending—hats off to that bittersweet piece of prose. Materially, this is the my favorite holistic amalgamation of competently weighted data sources and arguments woven into a cohesive narrative, and personally has me meaningfully reconsidering some of the more bearish points in my model (like the tractability of RL on non-verifiability: Gwern etc have made individual points, but something about this context-rich presentation really helped me grok the power of iterated-distillation-amplification at scale)
Mimetically, I did find the prose of the “slowdown” ending leans a little too eagerly into the presentation that it will be the smart, wise, heroic, and sexy alignment researchers who come in and save the day, and likely smells a touch too much of self-serving/propagandizing to convince some of those sophisticated yet undecided, but perhaps I cannot disagree with the central point that at the end of it all, how else are we going to survive?
My most bearish argument remains, however, that the real bitterness from Sutton is not that scalable algorithms dominate the non-scalable, but that hardware itself dominates the algorithms. Apropos of Nesov’s models and the chronically underappreciated S-curve reshaping of moore-esque laws: if by 2030 only 100-1000x present compute cannot get you a “datacenter full of geniuses” capable of radically redefining lithography and/or robotics, you are bitterly stuck waiting the decade for standard industrial cycles on post-silicon substrates etc. to bridge the gap. I find the https://ai-2027.com/research/compute-forecast blitheringly optimistic on the capabilities of 100-1000x compute, in both seemingly ignoring the bleak reality of a logarithmic loss curves on Goodhart polluted objectives (10-100x gpt4 compute most likely gave gpt4.5 the predicted loss reduction, but “predict sequences of data from previous data” objective further diverged from “make AGI” objective), whilst not appreciating how much the continually escalating sucking of compute brought by new inference(TTC), RL (IDA?), and training(continual learning?) paradigms will compound the problem.
How to square these claims with the GLP-1 drug consumption you’ve stated in a previous post? I’d wager this powers the pareto majority of your perceived ease of leanness vs average population, and that you are somewhat pointlessly sacrificing the virtues of variety in reverse-pareto fashion.
For funding timelines, I think the main question increasingly becomes: how much of the economical pie could be eaten by narrowly superhuman AI tooling? It doesn’t take hitting an infinity/singularity/fast takeoff for plausible scenarios under this bearish reality to nevertheless squirm through the economy at Cowen-approved diffusion rates and gradually eat insane $$$ worth of value, and therefore, prop up 100b+ buildouts. OAI’s latest
sponsored pysopleak today seems right in line with bullet point numero uno under real world predictions, that they are going to try and push 100 billion market eaters on us whether we, ahem, high taste commentators like it or not.Perhaps I am biased by years of seeing big-numbers-detached-from-reality in FAANG, but I see the centaurized Senior SWE Thane alluded too easily eating up a 100 billion chunk[1] worldwide (at current demand, not even adjusting for the marginal cost of software → size of software market relation!) Did anyone pay attention to the sharp RLable improvements in the O3-in-disguise Deep Research model card, vs O1? We aren’t getting the singularity, yes, but scaling RL on every verifiable code PR in existence (plus 10^? of synthetic copies) seems increasingly likely to get us the junior/mid level API (I hesitate to call it agent), that will write superhuman commits for the ~90% of PRs that have well-defined and/or explicitly testable objectives. Perhaps then we will finally start seeing some of that productivity 10xing that Thane is presently and correctly skeptical off; only Senior+ need apply of course.
(Side note: in the vein of documenting predictions, I currently predict that in the big tech market, at-scale Junior hiring is on its waning and perhaps penultimate cycle, with senior and especially staff compensation likewise soon skyrocketing as every ~1 mil/year USD quartet of supporting Juniors is replaced with a 300k/year Claude Pioneer subscription straight into an L6′s hands.)
I think the main danger is race-to-bottom dynamics and commoditization self-cannibalizing sufficient funding before it could plausibly take off to a N+1 paradigm, with all the requisite scaling in tow.- ^
Amazon alone has in the tens of thousands of US-based L4/Junior engineers, which with TC averaging ~160k * ~1.4 all-in cost of 225k a head, gives a solid 2 billion+ just from this one company, in one country, for one level of one job category.
- ^
My other comment was bearish, but in the bullish direction, I’m surprised Zvi didn’t include any of Gwern’s threads, like this or this, which apropos of Karpathy’s blind test I think have been the best clear examples of superior “taste” or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime’s weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it’s nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.
Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with.
Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, which replaced the full gpt-4 on chatgpt very quickly, and forget that there were many complaints back then that the faster (shrinking) model iterations were losing the “big model smell”, despite climbing the benchmarks.
And so when lots of people seem to describe 4.5′s advantages vs 4o as coming down to a “big model smell”, I think it is important to remember 4turbo and later 4o are clearly optimized for speed, price and benchmarks far more than original release gpt-4 was, and comparisons on taste/aesthetics/intangibles may be more fitting when using the original, non-goodharted, full scale gpt-4 model. At the very least, it should fully and properly represent what it looks like to have a clean ~10x less training compute vs 4.5.
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likely retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.
Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn’t artificially limit it.
Additionally, by saying
>can’t be repeated even with Rubin Ultra NVL576
I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable.