I understand that modern LLMs are generally trained only for a single epoch, or at most a few.
Is this true?
Why is this? Is it due to the cost of compute? Or is there just so much data available that you can always just expand the data set rather than using the same observations twice? Or for some other reason?
All else equal, in terms of capability and generalization per training iteration, you get the most bang for your buck from datasets that don’t just repeat themselves over and over.
Big bleeding edge/experimental models are often concerned most with training cost, not so much inference, so they’ll use any low hanging fruit for improving training efficiency within the target budget.
If you have enough data sitting around, you might as well use it.
For consumer facing products, a bit of “suboptimal” training to save time during inference can make sense. Throwing more epochs at that use case might win out sometimes since the loss does tend to keep going down (at least a bit). We might also see more epochs in any models that are up against the soft barrier of running out of easy tokens, but there are a lot of ways around that too.
I understand that modern LLMs are generally trained only for a single epoch, or at most a few.
Is this true?
Why is this? Is it due to the cost of compute? Or is there just so much data available that you can always just expand the data set rather than using the same observations twice? Or for some other reason?
Big chunk of it is something like:
All else equal, in terms of capability and generalization per training iteration, you get the most bang for your buck from datasets that don’t just repeat themselves over and over.
Big bleeding edge/experimental models are often concerned most with training cost, not so much inference, so they’ll use any low hanging fruit for improving training efficiency within the target budget.
If you have enough data sitting around, you might as well use it.
For consumer facing products, a bit of “suboptimal” training to save time during inference can make sense. Throwing more epochs at that use case might win out sometimes since the loss does tend to keep going down (at least a bit). We might also see more epochs in any models that are up against the soft barrier of running out of easy tokens, but there are a lot of ways around that too.