It seems very unlikely that they’re running their models at 32-bit precision. 8-bit seems more likely, or at most 16-bit. And yes, obviously batching and pipelining, and probably things comparable to all the attention-cost improvements that have been going on in the open-source side (if they didn’t invent them in parallel, they’ll certainly adopt them). Plus they mostly run Turbo models now: recent rumors about projects named Arrakis and Gobi plus the launch of GPT-4 Turbo suggest that making inference more efficient is very important to them.
Despite all that, I still wouldn’t be surprised if they were charging below cost, but I suspect they’re charging a price around where they think them can soon(ish) reduce inference costs to, between algorithmic improvements and Moore’s Law for GPUs.
Basically, they’re a start-up: they don’t need to be profitable yet, they need to persuade their investors that they have a creditable plan for reaching profitability in the next few years.
It seems very unlikely that they’re running their models at 32-bit precision. 8-bit seems more likely, or at most 16-bit. And yes, obviously batching and pipelining, and probably things comparable to all the attention-cost improvements that have been going on in the open-source side (if they didn’t invent them in parallel, they’ll certainly adopt them). Plus they mostly run Turbo models now: recent rumors about projects named Arrakis and Gobi plus the launch of GPT-4 Turbo suggest that making inference more efficient is very important to them.
Despite all that, I still wouldn’t be surprised if they were charging below cost, but I suspect they’re charging a price around where they think them can soon(ish) reduce inference costs to, between algorithmic improvements and Moore’s Law for GPUs.
Basically, they’re a start-up: they don’t need to be profitable yet, they need to persuade their investors that they have a creditable plan for reaching profitability in the next few years.