Note that these numbers are much higher than than the approx 60 million dollars[2] it would cost to rent all the hardware required for the duration of the final training run of GPT-4 if one were willing to commit to renting the hardware for a duration much longer than training, as is likely common for large AI labs. I think that the methodology I use better tracks the amount of investment needed to produce a frontier model. As a sanity check
Can you say more about your methodological choices here? I don’t think I buy it.
My “sanity check” says, you are estimating the total cost of training compute at ~7x the cost of the final training run compute, that seems wild?! 2x, sure, 3x, sure maybe, but spending 6⁄7 = 86% of your compute on testing before your final run does seem like a lot.
It seems pretty reasonable to use the cost of buying the GPUs outright instead of renting them, due to the lifecycle thing you mention. But then you also have to price in the other value the company is getting out of the GPUs, notably now their inference costs are made up only of operating costs. Or like, maybe OpenAI buys 25k A100s for a total of 375m. Maybe they plan on using them for 18 months total, they spend the first 6 months doing testing, 3 months on training, and then the last 9 months on inference. The cost for the whole training process should then only be considered 375m/2 = $188m (plus operating costs).
If you want to think of long-term rentals as mortgages, which again seems reasonable, you then have to factor in that the hardware isn’t being used only for one training cycle. It could be used for training multiple generations of models (e.g., this leak says many of the chips used for GPT-3 were also used for GPT-4 training), for running inference, or renting out to other companies when you don’t have a use for it.
I could be missing something, and I’m definitely not confident about any of this.
This is probably the decision I make I am the least confident in, figuring out how to do accounting on this issue is challenging and depends a lot on what one is going to use the “cost” of a training run to reason about. Some questions I had in mind when thinking about cost:
If a lone actor want to train a frontier model, without loans or financial assistance from others, how much capitol might they need.
How much money should I expect to have been spent by an AI lab that trains a new frontier model, especially a frontier model that is a significant advancement over all prior models (like GPT-4 was).
What is the largest frontier model it is feasible to create by any entity.
When a company trains a frontier model, how much are they “betting” on the future profitability of AI?
The simple initial way I use to compute cost than is to investigate empirical evidence of the expenditures of companies and investment.
Now, these numbers aren’t the same ones a company might care about—they represent expenses without accounting for likely revenue. The argument I find most tempting is that one should look at deprecation cost instead of capital expenditure, effectively subtracting the expected resale value of the hardware from the initial expenditure to purchase the hardware. I have two main reasons for not using this:
Computing deprecation cost is really hard, especially in this rapidly changing environment.
The resale value of an ML GPU is likely closely tied to profitability of training a model—if it turns out that using frontier models for inference isn’t very profitable than I’d expect the value of ML GPUs to decrease. Conversely, if inference is very profitable than the resale value would increase. I think A100s for example have had their price substantially impacted by increased interest in AI - it’s not implausible to me that the resale value of an A100 is actually higher than the initial cost was for OpenAI.
Having said all of this, I’m still not confident I made the right call here.
Also, I am relatively confident GPT-4 was trained only with A100s, and did not use any V100s as the colab notebook you linked speculates. I expect that GPT-3, GPT-4, and GPT-5 will all be trained with different generations of GPUs.
Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.
Can you say more about your methodological choices here? I don’t think I buy it.
My “sanity check” says, you are estimating the total cost of training compute at ~7x the cost of the final training run compute, that seems wild?! 2x, sure, 3x, sure maybe, but spending 6⁄7 = 86% of your compute on testing before your final run does seem like a lot.
It seems pretty reasonable to use the cost of buying the GPUs outright instead of renting them, due to the lifecycle thing you mention. But then you also have to price in the other value the company is getting out of the GPUs, notably now their inference costs are made up only of operating costs. Or like, maybe OpenAI buys 25k A100s for a total of 375m. Maybe they plan on using them for 18 months total, they spend the first 6 months doing testing, 3 months on training, and then the last 9 months on inference. The cost for the whole training process should then only be considered 375m/2 = $188m (plus operating costs).
If you want to think of long-term rentals as mortgages, which again seems reasonable, you then have to factor in that the hardware isn’t being used only for one training cycle. It could be used for training multiple generations of models (e.g., this leak says many of the chips used for GPT-3 were also used for GPT-4 training), for running inference, or renting out to other companies when you don’t have a use for it.
I could be missing something, and I’m definitely not confident about any of this.
This is probably the decision I make I am the least confident in, figuring out how to do accounting on this issue is challenging and depends a lot on what one is going to use the “cost” of a training run to reason about. Some questions I had in mind when thinking about cost:
If a lone actor want to train a frontier model, without loans or financial assistance from others, how much capitol might they need.
How much money should I expect to have been spent by an AI lab that trains a new frontier model, especially a frontier model that is a significant advancement over all prior models (like GPT-4 was).
What is the largest frontier model it is feasible to create by any entity.
When a company trains a frontier model, how much are they “betting” on the future profitability of AI?
The simple initial way I use to compute cost than is to investigate empirical evidence of the expenditures of companies and investment.
Now, these numbers aren’t the same ones a company might care about—they represent expenses without accounting for likely revenue. The argument I find most tempting is that one should look at deprecation cost instead of capital expenditure, effectively subtracting the expected resale value of the hardware from the initial expenditure to purchase the hardware. I have two main reasons for not using this:
Computing deprecation cost is really hard, especially in this rapidly changing environment.
The resale value of an ML GPU is likely closely tied to profitability of training a model—if it turns out that using frontier models for inference isn’t very profitable than I’d expect the value of ML GPUs to decrease. Conversely, if inference is very profitable than the resale value would increase. I think A100s for example have had their price substantially impacted by increased interest in AI - it’s not implausible to me that the resale value of an A100 is actually higher than the initial cost was for OpenAI.
Having said all of this, I’m still not confident I made the right call here.
Also, I am relatively confident GPT-4 was trained only with A100s, and did not use any V100s as the colab notebook you linked speculates. I expect that GPT-3, GPT-4, and GPT-5 will all be trained with different generations of GPUs.
Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.