[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]
As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn’t cost dollars to run per hundred pages, so scaled up ones aren’t going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? “Information wants to be expensive, because the right information can change your life...”) GPT-3 cost millions of dollars to train, but pennies to run.
That’s the terrifying thing about NNs and what I dub the “neural net overhang”: the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there’s less of a distinction between training and running, but it is of NNs.) This is part of why there’s a hardware overhang—once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.
That’s the terrifying thing about NNs and what I dub the “neural net overhang”: the cost to create a powerful NN is millions of times greater than the cost to run that NN.
I’m not sure why that’s terrifying. It seems reassuring to me because it means that there’s no way for the NN to suddenly go FOOM because it can’t just quickly retrain.
But it can. That’s the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can ‘train’ GPT-3 without even any gradient steps—just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.
With NNs, ‘foom’ is not merely possible, it’s the default. If you train a model, then as soon as it’s done you get, among other things:
the ability to run thousands of copies in parallel on the same hardware
in a context like AlphaGo, I estimate several hundred ELO strength gains if you reuse the same hardware to merely run tree search with exact copies of the original model
meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude
model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)
learning-by-doing/learning curve effects (highest in information technologies), so the next from-scratch model may be much cheaper (eg OA5 got a, what was it, 5x cost reduction for the second model OA trained from scratch based on the lessons of the first?)
baseline for engineering much more efficient ones by ablating and comparing with the original
model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
baseline for engineering much more efficient ones by ablating and comparing with the original
Somewhat related to these, if there’s such a huge gap between how expensive these models are to train and to run, then it seems like you’d end up wanting to run a whole bunch of them to help you train the next model, if you can.
You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?
But can a smaller model also be used to efficiently bootstrap a new, larger model?
I’m not sure it’s done much, but probably, depending on what you’re thinking. You can probably do reverse-distillation (eg dark knowledge—use the logits of the smaller model to provide a much richer feedback for the larger model when it’s untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we’d just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.)
On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it’s less significant whether you’re retraining from scratch or not. (Recapitulating the work equivalent to training yesterday’s models will be cheap, so no great benefit from bootstrapping.)
I’m not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it’s a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what?
At least thus far, we haven’t seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won’t be so easy to download some dumps and put a few TFRC TPUs to work. So we’ll see how many competitors emerge to GPT-3 over the next year or two!
This was mentioned in the “Other Constraints” section of the original post:
Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you’re at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with
I’m skeptical of this being a binding constraint too. $2/page is still very cheap.
[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]
As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn’t cost dollars to run per hundred pages, so scaled up ones aren’t going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? “Information wants to be expensive, because the right information can change your life...”) GPT-3 cost millions of dollars to train, but pennies to run.
That’s the terrifying thing about NNs and what I dub the “neural net overhang”: the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there’s less of a distinction between training and running, but it is of NNs.) This is part of why there’s a hardware overhang—once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.
I’m not sure why that’s terrifying. It seems reassuring to me because it means that there’s no way for the NN to suddenly go FOOM because it can’t just quickly retrain.
But it can. That’s the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can ‘train’ GPT-3 without even any gradient steps—just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.
With NNs, ‘foom’ is not merely possible, it’s the default. If you train a model, then as soon as it’s done you get, among other things:
the ability to run thousands of copies in parallel on the same hardware
in a context like AlphaGo, I estimate several hundred ELO strength gains if you reuse the same hardware to merely run tree search with exact copies of the original model
meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude
model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)
learning-by-doing/learning curve effects (highest in information technologies), so the next from-scratch model may be much cheaper (eg OA5 got a, what was it, 5x cost reduction for the second model OA trained from scratch based on the lessons of the first?)
baseline for engineering much more efficient ones by ablating and comparing with the original
Somewhat related to these, if there’s such a huge gap between how expensive these models are to train and to run, then it seems like you’d end up wanting to run a whole bunch of them to help you train the next model, if you can.
You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?
I’m not sure it’s done much, but probably, depending on what you’re thinking. You can probably do reverse-distillation (eg dark knowledge—use the logits of the smaller model to provide a much richer feedback for the larger model when it’s untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
Interesting.
So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we’d just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.)
On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it’s less significant whether you’re retraining from scratch or not. (Recapitulating the work equivalent to training yesterday’s models will be cheap, so no great benefit from bootstrapping.)
I’m not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it’s a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what?
At least thus far, we haven’t seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won’t be so easy to download some dumps and put a few TFRC TPUs to work. So we’ll see how many competitors emerge to GPT-3 over the next year or two!
It means that if there are approaches that don’t need as much compute, the AI can invent them fast.
This was mentioned in the “Other Constraints” section of the original post: