Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.
Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.