Order matters more at smaller scales—if you’re training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.
Order matters more at smaller scales—if you’re training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.