The bar for Nature papers is in many ways not so high. Latest says that if you train indiscriminately on recursively generated data, your model will probably exhibit what they call model collapse. They purport to show that the amount of such content on the Web is enough to make this a real worry, rather than something that happens only if you employ some obviously stupid intentional recursive loops.
According to Rylan Schaeffer and coauthors, this doesn’t happen if you append the generated data to the rest of your training data and train on this (larger) dataset. That is:
Nature paper: generate N rows of data, remove N rows of “real” data from the dataset and replace them with the N generated rows, train, and repeat
Leads to collapse
Schaeffer and co.: generate N rows of data, append the N generated rows to your training dataset (creating a dataset that is larger by N rows), train, and repeat
Does not lead to collapse
As a simplified model of what will happen in future Web scrapes, the latter seems obviously more appropriate than the former.
According to Rylan Schaeffer and coauthors, this doesn’t happen if you append the generated data to the rest of your training data and train on this (larger) dataset. That is:
Nature paper: generate N rows of data, remove N rows of “real” data from the dataset and replace them with the N generated rows, train, and repeat
Leads to collapse
Schaeffer and co.: generate N rows of data, append the N generated rows to your training dataset (creating a dataset that is larger by N rows), train, and repeat
Does not lead to collapse
As a simplified model of what will happen in future Web scrapes, the latter seems obviously more appropriate than the former.
I found this pretty convincing.
(See tweet thread and paper.)