Nathan Helm-Burger comments on Musings on Text Data Wall (Oct 2024)

Nathan Helm-Burger 5 Oct 2024 19:28 UTC
3 points
0
Some thoughts
1. When training multiple times on the same data, this is a method that helps avoid overfitting/memorization: https://arxiv.org/abs/2406.10209
2. Certain text sources are really information rich and could benefit from unpacking. In particular, academic papers could be richly annotated with auto-generated details like relevant quotes from each of the cited papers, or llm-generated comprehension questions and answers, verbal descriptions and analysis of the figures. I think that’d stretch the data a lot in high quality areas.
3. There’s been progress on novel algorithms/architectures which trade off training efficiency for saturation/forgetting resistance. E.g KANformers, mixtures of millions of experts. That could make a small model stretch a lot further.