Man, these data requirements for large models really show just how horrendously data-inefficient current deep learning actually is, you need to hit a model with thousands of different variations of a sentence for it to learn anything. I fear that we might be just one or two crucial insights away from cutting down those data numbers by orders of magnitude.
If I understand this correctly Deepmind is using each token in at most one update (They say they are training for less than one epoch), which means that it is hard to say anything about data efficiency of DL from this paper since the models are not trained to convergence on the data that they have seen.
They are probably doing this since they already have the data, and a new data point is more informative than an old one, even if your model is very slow to extract the available information with each update.
Man, these data requirements for large models really show just how horrendously data-inefficient current deep learning actually is, you need to hit a model with thousands of different variations of a sentence for it to learn anything. I fear that we might be just one or two crucial insights away from cutting down those data numbers by orders of magnitude.
If I understand this correctly Deepmind is using each token in at most one update (They say they are training for less than one epoch), which means that it is hard to say anything about data efficiency of DL from this paper since the models are not trained to convergence on the data that they have seen.
They are probably doing this since they already have the data, and a new data point is more informative than an old one, even if your model is very slow to extract the available information with each update.