If I understand this correctly Deepmind is using each token in at most one update (They say they are training for less than one epoch), which means that it is hard to say anything about data efficiency of DL from this paper since the models are not trained to convergence on the data that they have seen.
They are probably doing this since they already have the data, and a new data point is more informative than an old one, even if your model is very slow to extract the available information with each update.
I think it would be more accurate to say that the test was meant to check whether the TV shows were effective than whether the children had a maximal inherent tendency towards virtuousness.