Certain text sources are really information rich and could benefit from unpacking. In particular, academic papers could be richly annotated with auto-generated details like relevant quotes from each of the cited papers, or llm-generated comprehension questions and answers, verbal descriptions and analysis of the figures. I think that’d stretch the data a lot in high quality areas.
There’s been progress on novel algorithms/architectures which trade off training efficiency for saturation/forgetting resistance. E.g KANformers, mixtures of millions of experts. That could make a small model stretch a lot further.
Some thoughts
When training multiple times on the same data, this is a method that helps avoid overfitting/memorization: https://arxiv.org/abs/2406.10209
Certain text sources are really information rich and could benefit from unpacking. In particular, academic papers could be richly annotated with auto-generated details like relevant quotes from each of the cited papers, or llm-generated comprehension questions and answers, verbal descriptions and analysis of the figures. I think that’d stretch the data a lot in high quality areas.
There’s been progress on novel algorithms/architectures which trade off training efficiency for saturation/forgetting resistance. E.g KANformers, mixtures of millions of experts. That could make a small model stretch a lot further.