I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it’s solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic’s work on Sleeper Agents seems like a very strong piece of evidence that it is the case.
Still, if they’re right that KANs don’t have this problem at much smaller sizes than MLP-based NNs, that’s very interesting. Nevertheless, I think talking about catastrophic forgetting as a “serious problem in modern ML” seems significantly misleading
The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.
But yeah, it’s not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.
I’m not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B’s training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?
I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it’s solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic’s work on Sleeper Agents seems like a very strong piece of evidence that it is the case.
Still, if they’re right that KANs don’t have this problem at much smaller sizes than MLP-based NNs, that’s very interesting. Nevertheless, I think talking about catastrophic forgetting as a “serious problem in modern ML” seems significantly misleading
Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section
The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.
But yeah, it’s not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.
I’m not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B’s training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?
I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.