While working on another post, I decided to follow up some details by doing some naive modeling of OpenAI’s LLM API revenue stream. The naive approach seems inadequate, because it implies OpenAI requires many years to break even just on the cost of GPUs.
OpenAI charges the following rates (from the OpenAI pricing page):
GPT-3.5 Turbo: input $0.001/1k tokens, output $0.002/1k tokens.
GPT-4 (non-Turbo): input $0.03/1k tokens, output $0.06/1k tokens.
How quickly do the GPTs generate tokens? Data pulled from some random people doing testing on Reddit, of all places in the local LLaMA subreddit. The post is 4 months old, so they were testing 3.5 Turbo and 4 non-Turbo (4 Turbo launched earlier this month).
GPT-3.5 Turbo: ~100 tokens/s
GPT-4 (non-Turbo): ~12-13 tokens/s
This is the weakest part of the analysis, it’s just some people doing tests with a stopwatch. If you have a better source please let me know.
With this data, we can calculate revenue for a single model running at 100% utilization.
Template: (<? tokens/s>) · (<? $/1k tokens>) · (31,557,600s/year) = $/year
GPT-3.5 Turbo: $6,311.52/year
GPT-4 (non-Turbo): $18,934.56/year
I’m unsure of how to model the input tokens.
How much does it cost to run a single model?
The last GPT model we have solid numbers for is GPT-3, the largest version of which has 175B parameters; Wikipedia claims it requires 800GB to store, which more or less fits the straightforward 32 bits/parameter · 175B parameters calculation.
800GB is a magnitude larger than the largest GPU memory size, so multiple GPUs are necessary to run a GPT-3 model.
People appear to be quite confident that the later models are even larger: there is a Manifold market that is 88% on GPT-4 having over 1 trillion parameters. I will use GPT-3’s numbers as a placeholder for now, since it is still illustrative.
Which GPU might be used? Using recent high end GPUs price points:
Therefore the initial capital outlay to fully load the model across multiple GPUs is:
H100: 10 · $30k = $300k
A100: 10 · $18k = $180k
RTX 4070: 67 · $600 = $40.2k
Therefore breaking even on just the GPU capital outlay can take 2-48 years, depending on which chips are used for which pricing regime, GPT-3.5 or GPT-4. (2 years for getting GPT-4 rates from 67 RTX 4070s, and 48 years for getting GPT-3.5 rates from 10 H100s.)
However, the field of AI is moving quickly:
Nvidia plans to ramp up production for AI (August 2023, Reuters). Along with the recent spate of 3 new data center GPU architectures in the last 3 years (Wikipedia), it seems likely that Nvidia will continue producing new chip generations.
OpenAI is working on GPT-5 (November 2023, Tom’s Guide; original interview is behind the Financial Times paywall). Presumably the new model will use even more resources.
With new chips and new models quickly approaching, the lifetime of these current GPUs seems pretty short. Say it takes 8 years to recoup costs, but the GPU’s computing power becomes irrelevant within 4, effectively losing half the cost of the GPU.
OpenAI’s prices seem too low to recoup even part of their capital costs in a reasonable time given the volatile nature of the AI industry. Surely I’m missing something obvious?
Other Factors
Other factors I didn’t include in the model above, which may make cost/revenue increase/decrease:
COST+: the model not only has to pay for itself, it needs to pay for training costs/the data center/electricity/staff/buildings for staff/free tier queries.
COST+: these models aren’t going to be used at 100% efficiency; user numbers will ebb and flow over the course of a day, and GPUs are physical objects with failure rates.
COST+: the actual models are likely larger than GPT-3, so the GPU costs would be even larger.
REV+: the revenue from input tokens isn’t included, perhaps (pure speculation) this would push revenue higher by 2x?
REV+: perhaps the token generation rates we can see are misleading. If a model is time shared aggressively it may be serving 100 tokens/s to many users at once. This seems somewhat weird to me (why would you (for example) produce 10 tokens for user A, then user B, etc? Wouldn’t you need to constantly re-pay set up costs for each user? Would the gains in fairness/uniformity of response times really be worth it?), but maybe I’m overestimating how difficult this would be to engineer.
COST-: the 800GB model size comes from treating each parameter as a full 32 bit float, but the ML field has been trending towards less precision. If all the parameters are 16 bits, then we cut memory requirements (and therefore GPU costs) in half. 8 bit quantization is also possible, but my understanding is that a full 8 bit model can be unstable (“The main problem with using 8-bit precision is that transformers can get very unstable with so few bits”, buried in this post about DL GPUs), so it seems unlikely that the GPTs are using able to fully cut their size to 1/4th.
COST-: the value of a GPU doesn’t depreciate to $0 as soon as a new GPT version/Nvidia architecture comes out. As a speculative example, smaller AI shops may be willing to snap up cheaper A100s in 4 years when OpenAI is no longer using them, recouping some costs. If the chip shortage is still ongoing the GPUs may even keep most of their value.
However, relying on this seems like a mistake; why would you eat this depreciation risk if you didn’t need to?
COST+: surely GPT-3+, the poster children of generative AI, are not scraping together lower end RTX 4070s to do inference? The cost/GB numbers are good, but I completely ignore basically every other performance metric you may want from a GPU.
REV+: perhaps the actual token generation rates aren’t so slow, and fixed costs like network transit time dominate and make token rates look much slower?
However, eyeballing the raw data from the Reddit thread it looks like if there is a fixed cost, it isn’t obvious, since generating 100 tokens and 700 tokens are both in the same 11-14 tokens/s range.
COST-: perhaps I’ve simply misunderstood how these large models are run. Instead of running many GPUs in parallel, each model only runs on one GPU. The H100 model with the smallest GPU memory bandwidth can load its full 80GB memory 20 times/second (2TB/s, see the H100 PCIe under Product Specifications), enough to theoretically pipe an entire 800GB model through the GPU in half a second.
However, even if it works this seems pretty wasteful to me: why not instead run the GPUs in parallel and not pay the memory loading costs over and over? Wouldn’t that lead to much better latency and better hardware utilization?
I used retail pricing for everything, while OpenAI is likely sourcing their GPUs cheaper directly from the manufacturer.
However, I would also assume that OpenAI offers B2B pricing to large customers as well, so the savings in cost may well be balanced out by lower revenue.
OpenAI is still effectively a startup, so it might be fine simply losing lots of money.
However, OpenAI has a great market leading position in a brand new field, why would they need to loss lead like Uber or WeWork?
OpenAI is also a non-profit, but that doesn’t mean they should take a loss for no reason.
We would also need to reconcile the money pit hypothesis with “OpenAI on track to generate more than $1 bln revenue over 12 months” (August 2023, Reuters), which compared to “ChatGPT cost a fortune to make with OpenAI’s losses growing to $540 million last year, report says” (May 2023, Business Insider) (last year being 2022) is certainly suggestive. Definitely doesn’t mean OpenAI is profitable, but is pretty good evidence that they’re trying.
Yes: batching. Efficient GPU inference uses matrix matrix multiplication not vector matrix multiplication.
+1 to Cannell’s answer, and I’ll also add pipelining.
Let’s say (one instance of) the system is distributed across 10 GPUs, arranged in series—to to do a forward pass, the first GPU does some stuff, passes its result to the second GPU, which passes to the third, etc. If only one user at a time were being serviced, then 90% of those GPUs would be idle at any given time. But pipelining means that, once the first GPU in line has finished one request (or, realistically, batch of requests), it can immediately start on another batch of requests.
More generally: the rough estimate in the post above tries to estimate throughput from latency, which doesn’t really work. Parallelism/pipelining mean that latency isn’t a good way to measure throughput, unless we also know how many requests are processed in parallel at a time.
(Also I have been operating under the assumption that OpenAI is not profitable at-the-margin, and I’m curious to see an estimate.)
It seems very unlikely that they’re running their models at 32-bit precision. 8-bit seems more likely, or at most 16-bit. And yes, obviously batching and pipelining, and probably things comparable to all the attention-cost improvements that have been going on in the open-source side (if they didn’t invent them in parallel, they’ll certainly adopt them). Plus they mostly run Turbo models now: recent rumors about projects named Arrakis and Gobi plus the launch of GPT-4 Turbo suggest that making inference more efficient is very important to them.
Despite all that, I still wouldn’t be surprised if they were charging below cost, but I suspect they’re charging a price around where they think them can soon(ish) reduce inference costs to, between algorithmic improvements and Moore’s Law for GPUs.
Basically, they’re a start-up: they don’t need to be profitable yet, they need to persuade their investors that they have a creditable plan for reaching profitability in the next few years.
I think they might be loss-leading to compete against the counterfactual of status-quo-bias, the not-using-a-model-at-all state of being. Once companies start to pay the cost to incorporate the LLMs into their workflows, I see no reason why OpenAI can’t just increase the price. I think this might happen by simply releasing a new improved model at a much higher price. If everyone is using and benefiting already from the old model, and the new one is clearly better, the higher price will be easier to justify as a good investment for businesses.
With basically a blank check from VC, they’ll instead invest in making their models and infra more efficient/better instead of raising prices. They can run a large loss for a very long time.
Why though? They have a capped profit model (theoretically) so there’s less value in this strategy, and their biggest investor would probably prefer that people use Bing instead.
General AI services is a natural monopoly. It has a large fixed cost to develop a competitive model, and lower marginal costs to deliver.
The best* model will have the most paying customers. It’s a monopoly flywheel, the monopoly niche occupant reinvests in the most compute and the best engineers for model improvement, and the N+1 model is even more dominant and so on.
There is second network effect involved in hosting platforms for AI services. This can be an even strongest monopoly. Assuming the “app store” has some common copyrighted APIs for intercommunication between AI tools, it could make it impractical for companies offering models on the store to sell their wares anywhere else. This sends revenue to the monopoly platform owner even after they no longer offer the best model.
OpenAI seems to be pursuing both avenues like any for profit startup would. Their board has recently voted to lift the profit cap by 20 percent per year. ( https://www.economist.com/business/2023/11/21/inside-openais-weird-governance-structure )
*Refusing certain services, and refusing to offer long term guarantees, such as forever access to a frozen weight model, means openAI is leaving the door open to be evicted from this market niche.
News is the cap grows 20% a year so it will really last until AGI