johnswentworth comments on johnswentworth’s Shortform

johnswentworth 21 Jun 2024 16:50 UTC
77 points
12
NVIDIA Is A Terrible AI Bet
Short version: Nvidia’s only moat is in software; AMD already makes flatly superior hardware priced far lower, and Google probably does too but doesn’t publicly sell it. And if AI undergoes smooth takeoff on current trajectory, then ~all software moats will evaporate early.
Long version: Nvidia is pretty obviously in a hype-driven bubble right now. However, it is sometimes the case that (a) an asset is in a hype-driven bubble, and (b) it’s still a good long-run bet at the current price, because the company will in fact be worth that much. Think Amazon during the dot-com bubble. I’ve heard people make that argument about Nvidia lately, on the basis that it will be ridiculously valuable if AI undergoes smooth takeoff on the current apparent trajectory.
My core claim here is that Nvidia will not actually be worth much, compared to other companies, if AI undergoes smooth takeoff on the current apparent trajectory.
Other companies already make ML hardware flatly superior to Nvidia’s (in flops, memory, whatever), and priced much lower. AMD’s MI300x is the most obvious direct comparison. Google’s TPUs are probably another example, though they’re not sold publicly so harder to know for sure.
So why is Nvidia still the market leader? No secret there: it’s the CUDA libraries. Lots of (third-party) software is built on top of CUDA, and if you use non-Nvidia hardware then you can’t use any of that software.
That’s exactly the sort of moat which will disappear rapidly if AI automates most-or-all software engineering, and on current trajectory software engineering would be one of the earlier areas to see massive AI acceleration. In that world, it will be easy to move any application-level program to run on any lower-level stack, just by asking an LLM to port it over.
So in worlds where AI automates software engineering to a very large extent, Nvidia’s moat is gone, and their competition has an already-better product at already-lower price.
What links here?
- Peak Human Capital by PeterMcCluskey (30 Sep 2024 21:13 UTC; 70 points)
- MichaelStJules 22 Jun 2024 13:34 UTC
  29 points
  0
  Parent
  Why do you believe AMD and Google make better hardware than Nvidia?
  - johnswentworth 22 Jun 2024 18:25 UTC
    20 points
    0
    Parent
    The easiest answer is to look at the specs. Of course specs are not super reliable, so take it all with many grains of salt. I’ll go through the AMD/Nvidia comparison here, because it’s a comparison I looked into a few months back.
    MI300x vs H100
    Techpowerup is a third-party site with specs for the MI300x and the H100, so we can do a pretty direct comparison between those two pages. (I don’t know if the site independently tested the two chips, but they’re at least trying to report comparable numbers.) The H200 would arguably be more of a “fair comparison” since the MI300x came out much later than the H100; we’ll get to that comparison next. I’m starting with MI300x vs H100 comparison because techpowerup has specs for both of them, so we don’t have to rely on either company’s bullshit-heavy marketing materials as a source of information. Also, even the H100 is priced 2-4x more expensive than the MI300x (~$30-45k vs ~$10-15k), so it’s not unfair to compare the two.
    Key numbers (MI300x vs H100):
    float32 TFLOPs: ~80 vs ~50
    float16 TFLOPs: ~650 vs ~200
    memory: 192 GB vs 80 GB (note that this is the main place where the H200 improves on the H100)
    bandwidth: ~10 TB/s vs ~2 TB/s
    … so the comparison isn’t even remotely close. The H100 is priced 2-4x higher but is utterly inferior in terms of hardware.
    MI300x vs H200
    I don’t know of a good third-party spec sheet for the H200, so we’ll rely on Nvidia’s page. Note that they report some numbers “with sparsity” which, to make a long story short, means those numbers are blatant marketing bullshit. Other than those numbers, I’ll take their claimed specs at face value.
    Key numbers (MI300x vs H200):
    float32 TFLOPs: ~80 vs ~70
    float16 TFLOPs: don’t know, Nvidia conspicuously avoided reporting that number
    memory: 192 GB vs 141 GB
    bandwidth: ~10 TB/s vs ~5 TB/s
    So they’re closer than the MI300x vs H100, but the MI300x still wins across the board. And pricewise, the H200 is probably around $40k, so 3-4x more expensive than the MI300x.
    - ryan_greenblatt 22 Jun 2024 21:09 UTC
      24 points
      2
      Parent
      Its worth noting that even if nvidia is charging 2-4x more now, the ultimate question for competitiveness will be manufactoring cost for nvidia vs amd. If nvidia has much lower manufactoring costs than amd per unit performance (but presumably higher markup), then nvidia might win out even if their product is currently worse per dollar.
      
      Note also that price discrimination might be a big part of nvidia’s approach. Scaling labs which are willing to go to great effort to drop compute cost by a factor of two are a subset of nvidia’s customers where nvidia would ideally prefer to offer lower prices. I expect that nvidia will find a way to make this happen.
- PeterMcCluskey 23 Jun 2024 3:33 UTC
  17 points
  0
  Parent
  I’m holding a modest long position in NVIDIA (smaller than my position in Google), and expect to keep it for at least a few more months. I expect I only need NVIDIA margins to hold up for another 3 or 4 years for it to be a good investment now.
  
  It will likely become a bubble before too long, but it doesn’t feel like one yet.
- Tao Lin 21 Jun 2024 18:48 UTC
  12 points
  3
  Parent
  No, the mi300x is not superior to nvidias chips, largely because It costs >2x to manufacture as nvidias chips
- James Payor 21 Jun 2024 17:38 UTC
  11 points
  1
  Parent
  While the first-order analysis seems true to me, there are mitigating factors:
  - AMD appears to be bungling on their GPUs being reliable and fast, and probably will for another few years. (At least, this is my takeaway from following the TinyGrad saga on Twitter...) Their stock is not valued as it should be for a serious contender with good fundamentals, and I think this may stay the case for a while, if not forever if things are worse than I realize.
  - NVIDIA will probably have very-in-demand chips for at least another chip generation due to various inertias.
  - There aren’t many good-looking places for the large amount of money that wants to be long AI to go right now, and this will probably inflate prices for still a while across the board, in proportion to how relevant-seeming the stock is. NVDA rates very highly on this one.
  So from my viewpoint I would caution against being short NVIDIA, at least in the short term.
- Ann 21 Jun 2024 17:17 UTC
  9 points
  0
  Parent
  Potential counterpoints:
  - If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.
  - The disadvantages of AMD software development potentially need to be addressed at levels not accessible to an arbitrary feral automated software engineer in the wild, to make the stack sufficiently usable. (A lot of actual human software engineers would like the chance.)
  - NVIDIA is training their own AIs, who are pretty capable.
  - NVIDIA can invest their current profits. (Revenues, not stock valuations.)
  - gwern 21 Jun 2024 23:10 UTC
    13 points
    −3
    Parent
    
    If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.
    
    I don’t think the advantages would necessarily compound—quite the opposite, there are diminishing returns and I expect ‘catchup’. The first-mover advantage neutralizes itself because a rising tide lifts all boats, and the additional data acts as a prior: you can define the advantage of a better model, due to any scaling factor, as equivalent to n additional datapoints. (See the finetuning transfer papers on this.) When a LLM can zero-shot a problem, that is conceptually equivalent to a dumber LLM which needs 3-shots, say. And so the advantages of a better model will plateau, and can be matched by simply some more data in-context—such as additional synthetic datapoints generated by self-play or inner-monologue etc. And the better the model gets, the more ‘data’ it can ‘transfer’ to a similar language to reach a given X% of coding performance. (Think about how you could easily transfer given access to an environment: just do self-play on translating any solved Python problem into the target language. You already, by stipulation, have an ‘oracle’ to check outputs of the target against, which can produce counterexamples.) To a sad degree, pretty much all programming languages are the same these days: ALGOL with C sugaring to various degrees and random ad hoc addons; a LLM which can master Python can master Javascript can master Typescript… The hard part is the non-programming-language parts, the algorithms and reasoning and being able to understand & model the implicit state updates—not memorizing the standard library of some obscure language.
    
    So at some point, even if you have a model which is god-like at Python (at which point each additional Python datapoint adds basic next to nothing), you will find it is completely acceptable at JavaScript, say, or even your brand-new language with 5 examples which you already have on hand in the documentation. You don’t need ‘the best possible performance’, you just need some level of performance adequate to achieve your goal. If the Python is 99.99% on some benchmark, you are probably fine with 99.90% performance in your favorite language. (Presumably there is some absolute level like 99% at which point automated CUDA → ROCm becomes possible, and it is independent of whether some other language has even higher accuracy.) All you need is some minor reason to pay that slight non-Python tax. And that’s not hard to find.
    
    If AI automates most, but not all, software engineering
    
    Also, I suspect that the task of converting CUDA code to ROCm code might well fall into the ‘most’ category rather than being the holdout programming tasks. This is a category of code ripe for automation: you have, again by stipulation, correct working code which can be imitated and used as an oracle autonomously to brute force translation, which usually has very narrow specific algorithmic tasks (‘multiply this matrix by that matrix to get this third matrix; every number should be identical’), random test-cases are easy to generate (just big grids of numbers), and where the non-algorithmic number also has simple end-to-end metrics (‘loss go down per wallclock second’) to optimize. Compared to a lot of areas, like business logic or GUIs, this seems much more amenable to tasking LLMs with. geohot may lack the followthrough to make AMD GPUs work, and plow through papercut after papercut, but there would be no such problem for a LLM.
    
    So I agree with Wentsworth that there seems to be a bit of a tricky transition here for Nvidia: it’s always not been worth the time & hassle to try to use an AMD GPU (although a few claim to have made it work out financially for them), because of the skilled labor and wallclock and residual technical risk and loss of flexibility ecosystem; but if LLM coding works out well enough and intelligence becomes ‘too cheap to meter’, almost all of that goes away. Even ordinary unsophisticated GPU buyers will be able to tell their LLM to ‘just make it work on my new GPU, OK? I don’t care about the details, just let me know when you’re done’. At this point, what is the value-add for Nvidia? If they cut down their fat margins and race to the bottom for the hardware, where do they go for the profits? The money all seems to be in the integration and services—none of which Nvidia is particularly good at. (They aren’t even all that good at training LLMs! The Megatron series was a disappointment, like Megatron-NLG-530b is barely a footnote, and even the latest Nemo seems to barely match Llama-3-70b which being like 4x larger and thus more expensive to run.)
    
    And this will be true of anyone who is relying on software lockin: if the lockin is because it would take a lot of software engineer time to do a reverse-engineering rewrite and replacement, then it’s in serious danger in a LLM human coding level world. In a world where you can hypothetically spin up a thousand SWEs on a cloud service, tell them, ‘write me an operating system like XYZ’, and they do so overnight while you sleep, durable software moats are going to require some sort of mysterious blackbox like a magic API; anything which is so modularized as to fit on your own computer is also sufficiently modularized as to easily clone & replace...
    - Ann 22 Jun 2024 1:30 UTC
      3 points
      0
      Parent
      It’s probably worth mentioning that there’s now a licensing barrier to running CUDA specifically through translation layers: https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
      
      This isn’t a pure software engineering time lockin; some of that money is going to go to legal action looking for a hint big targets have done the license-noncompliant thing.
      
      Edit: Additionally, I don’t think a world where “most but not all” software engineering is automated is one where it will be a simple matter to spin up a thousand effective SWEs of that capability; I think there’s first a world where that’s still relatively expensive even if most software engineering is being done by automated systems. Paying $8000 for overnight service of 1000 software engineers would be a rather fine deal, currently, but still too much for most people.
      - gwern 22 Jun 2024 13:46 UTC
        8 points
        5
        Parent
        I don’t think that will be at all important. You are creating alternate reimplementations of the CUDA API, you aren’t ‘translating’ or decompiling it. And if you are buying billions of dollars of GPUs, you can afford to fend off some Nvidia probes and definitely can pay $0.000008b periodically for an overnighter. (Indeed, Nvidia needing to resort to such Oracle-like tactics is a bear sign.)
        Ann 22 Jun 2024 18:31 UTC
        3 points
        0
        Parent
        While there’s truth in what you say, I also think a market that’s running thousands of software engineers is likely to be hungry for as many good GPUs as the current manufacturers can make. NVIDIA not being able to sustain a relative monopoly forever still doesn’t put it in a bad position.
        gwern 22 Jun 2024 22:08 UTC
        18 points
        2
        Parent
        People will hunger for all the GPUs they can get, but then that means that the favored alternative GPU ‘manufacturer’ simply buys out the fab capacity and does so. Nvidia has no hardware moat: they do not own any chip fabs, they don’t own any wafer manufacturers, etc. All they do is design and write software and all the softer human-ish bits. They are not ‘the current manufacturer’ - that’s everyone else, like TSMC or the OEMs. Those are the guys who actually manufacture things, and they have no particular loyalty to Nvidia. If AMD goes to TSMC and asks for a billion GPU chips, TSMC will be thrilled to sell the fab capacity to AMD rather than Nvidia, no matter how angry Jensen is.
        
        So in a scenario like mine, if everyone simply rewrites for AMD, AMD raises its prices a bit and buys out all of the chip fab capacity from TSMC/Intel/Samsung/etc—possibly even, in the most extreme case, buying capacity from Nvidia itself, as it suddenly is unable to sell anything at its high prices that it may be trying to defend, and is forced to resell its reserved chip fab capacity in the resulting liquidity crunch. (No point in spending chip fab capacity on chips you can’t sell at your target price and you aren’t sure what you’re going to do.) And if AMD doesn’t do so, then player #3 does so, and everyone rewrites again (which will be easier the second time as they will now have extensive test suites, two different implementations to check correctness against, documentation from the previous time, and AIs which have been further trained on the first wave of work).
        Radford Neal 22 Jun 2024 22:14 UTC
        6 points
        2
        Parent
        But why would the profit go to NVIDIA, rather than TSMC? The money should go to the company with the scarce factor of production.
  - Ann 21 Jun 2024 18:45 UTC
    3 points
    0
    Parent
    (… lol. That snuck in without any conscious intent to imply anything, yes. I haven’t even personally interacted with the open Nvidia models yet.)
    
    I do think the analysis is a decent map to nibbling at NVIDIA’s pie share if you happen to be a competitor already—AMD, Intel, or Apple currently, to my knowledge, possibly Google depending what they’re building internally and if they decide to market it more. Apple’s machine learning ecosystem is a bit of a parallel one, but I’d be at least mildly interested in it from a development perspective, and it is making progress.
    
    But when it comes to the hardware, this is a sector where it’s reasonably challenging to conjure a competitor out of thin air still, so competitor behavior—with all its idiosyncrasies—is pretty relevant.
- jmh 23 Jun 2024 2:57 UTC
  6 points
  0
  Parent
  Two questionson this.
  First, if AI is a big value driver, in a general economic sense, is your view that NVIDIA is over prices against its future potential or just that relatively NVIDIA will under perform other investment alternatives you see.
  Second, and perhaps an odd and speculative (perhaps nonsense) thought. I would expect that in this area one might see some network effects in play as well so wondering if that might impact the AI engineering decisions on software. Could the AI software solutions look towards maximising the value of the installed network (AIs work better on a common chip and code infrastructure) than will be true if one looks at some isolated technical stats. A bit a long the lines of why Beta was displaced by VHS dispite being a better technology. If so, then it seems possible that NVIDA could remain a leader and enjoy its current pricing powers (at least to some extent) for a fairly long period of time.
- Josh You 30 Jun 2024 23:25 UTC
  4 points
  0
  Parent
  AI that can rewrite CUDA is a ways off. It’s possible that it won’t be that far away in calendar time, but it is far away in terms of AI market growth and hype cycles. If GPT-5 does well, Nvidia will reap the gains more than AMD or Google.
- havdvdbd 17 Nov 2024 6:13 UTC
  3 points
  0
  Parent
  Transpiling assembly code written for one OS/kernel to assembly code for another OS/kernel while taking advantage the full speed of the processor, is a completely different task from transpiling say, java code into python.
  
  Also, the hardware/software abstraction might break. A python developer can say hardware failures are not my problem. An assembly developer working at an AGI lab needs to consider hardware failures as lost wallclock time in their company’s race to AGI, and will try to write code so that hardware failures don’t cause the company to lose time.
  
  GPT4 definitely can’t do this type of work and I’ll bet a lot of money GPT5 can’t do it either. ASI can do it but there’s bigger considerations than whether Nvidia makes money there, such as whether we’re still alive and whether markets and democracy continue to exist. Making a guess of N for which GPT-N can get this done requires evaluating how hard of a software task this actually is, and your comment contains no discussion of this.
  
  Have you looked at tinygrad’s codebase or spoken to George Hotz about this?
- O O 22 Jun 2024 14:09 UTC
  3 points
  0
  Parent
  Shorting nvidia might be tricky. I’d short nvidia and long TSM or an index fund to be safe at some point. Maybe now? Typically the highest market cap stock has poor performance after it claims that spot.

johnswentworth comments on johnswentworth’s Shortform

NVIDIA Is A Terrible AI Bet

MI300x vs H100

MI300x vs H200