“The most immediate implication of this would be that you now have a scaling law for transfer learning, and so you can predict how large a general-purpose model you need in order to obtain the necessary performance on a given low-n dataset. So if you have some economic use-cases in mind where you only have, say, _n_=1000 datapoints, you can use this to estimate what scale model is necessary to make that viable. My first impression is that this power law looks quite favorable*, and so this is a serious shot across the bow to any old-style AI startups or businesses which thought “our moat is our data, no one has the millions of datapoints necessary (because everyone knows DL is so sample-inefficient) to compete with us”. The curves here indicate that just training a large as possible model on broad datasets is going to absolutely smash anyone trying to hand-curate finetuning datasets, especially for the small datasets people worry most about...
* eg the text->python example: the text is basically just random Internet text (same as GPT-3), Common Crawl etc, nothing special, not particularly programming-related, and the python is Github Python code; nevertheless, the transfer learning is very impressive: a 10x model size increase in the pretrained ‘text’ model is worth 100x more Github data!”
gwern on reddit:
“The most immediate implication of this would be that you now have a scaling law for transfer learning, and so you can predict how large a general-purpose model you need in order to obtain the necessary performance on a given low-n dataset. So if you have some economic use-cases in mind where you only have, say, _n_=1000 datapoints, you can use this to estimate what scale model is necessary to make that viable. My first impression is that this power law looks quite favorable*, and so this is a serious shot across the bow to any old-style AI startups or businesses which thought “our moat is our data, no one has the millions of datapoints necessary (because everyone knows DL is so sample-inefficient) to compete with us”. The curves here indicate that just training a large as possible model on broad datasets is going to absolutely smash anyone trying to hand-curate finetuning datasets, especially for the small datasets people worry most about...
* eg the text->python example: the text is basically just random Internet text (same as GPT-3), Common Crawl etc, nothing special, not particularly programming-related, and the python is Github Python code; nevertheless, the transfer learning is very impressive: a 10x model size increase in the pretrained ‘text’ model is worth 100x more Github data!”