People do this a lot with BERT, and it has its own problems—the first section of this recent paper gives a good overview.
Then of course there is plenty of work trying to mitigate those problems, like that paper . . . but there are still various ways of doing so, with no clear consensus. So a more general statement of few-shot’s promise might be “you don’t have to worry about which fine-tuning setup you’re going to use, out of the many available alternatives, all of which have pitfalls.”
I think the results in that paper argue that it’s not really a big deal as long as you don’t make some basic errors like trying to fine-tune on tasks sequentially. MT-A outperforms Full in Table 1. GPT-3 is already a multi-task learner (as is BERT), so it would be very surprising if training on fewer tasks was too difficult for it.
People do this a lot with BERT, and it has its own problems—the first section of this recent paper gives a good overview.
Then of course there is plenty of work trying to mitigate those problems, like that paper . . . but there are still various ways of doing so, with no clear consensus. So a more general statement of few-shot’s promise might be “you don’t have to worry about which fine-tuning setup you’re going to use, out of the many available alternatives, all of which have pitfalls.”
I think the results in that paper argue that it’s not really a big deal as long as you don’t make some basic errors like trying to fine-tune on tasks sequentially. MT-A outperforms Full in Table 1. GPT-3 is already a multi-task learner (as is BERT), so it would be very surprising if training on fewer tasks was too difficult for it.