model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
baseline for engineering much more efficient ones by ablating and comparing with the original
Somewhat related to these, if there’s such a huge gap between how expensive these models are to train and to run, then it seems like you’d end up wanting to run a whole bunch of them to help you train the next model, if you can.
You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?
But can a smaller model also be used to efficiently bootstrap a new, larger model?
I’m not sure it’s done much, but probably, depending on what you’re thinking. You can probably do reverse-distillation (eg dark knowledge—use the logits of the smaller model to provide a much richer feedback for the larger model when it’s untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we’d just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.)
On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it’s less significant whether you’re retraining from scratch or not. (Recapitulating the work equivalent to training yesterday’s models will be cheap, so no great benefit from bootstrapping.)
I’m not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it’s a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what?
At least thus far, we haven’t seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won’t be so easy to download some dumps and put a few TFRC TPUs to work. So we’ll see how many competitors emerge to GPT-3 over the next year or two!
Somewhat related to these, if there’s such a huge gap between how expensive these models are to train and to run, then it seems like you’d end up wanting to run a whole bunch of them to help you train the next model, if you can.
You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?
I’m not sure it’s done much, but probably, depending on what you’re thinking. You can probably do reverse-distillation (eg dark knowledge—use the logits of the smaller model to provide a much richer feedback for the larger model when it’s untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
Interesting.
So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we’d just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.)
On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it’s less significant whether you’re retraining from scratch or not. (Recapitulating the work equivalent to training yesterday’s models will be cheap, so no great benefit from bootstrapping.)
I’m not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it’s a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what?
At least thus far, we haven’t seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won’t be so easy to download some dumps and put a few TFRC TPUs to work. So we’ll see how many competitors emerge to GPT-3 over the next year or two!