Catastrophic forgetting is a serious problem in current machine learning [24]. When a human masters a task and switches to another task, they do not forget how to perform the first task. Unfortunately, this is not the case for neural networks. When a neural network is trained on task 1 and then shifted to being trained on task 2, the network will soon forget about how to perform task 1. A key difference between artificial neural networks and human brains is that human brains have functionally distinct modules placed locally in space. When a new task is learned, structure re-organization only occurs in local regions responsible for relevant skills [25, 26], leaving other regions intact. Most artificial neural networks, including MLPs, do not have this notion of locality, which is probably the reason for catastrophic forgetting.
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it’s solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic’s work on Sleeper Agents seems like a very strong piece of evidence that it is the case.
Still, if they’re right that KANs don’t have this problem at much smaller sizes than MLP-based NNs, that’s very interesting. Nevertheless, I think talking about catastrophic forgetting as a “serious problem in modern ML” seems significantly misleading
The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.
But yeah, it’s not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.
I’m not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B’s training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?
I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.
Wow, this is super fascinating.
A juicy tidbit:
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it’s solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic’s work on Sleeper Agents seems like a very strong piece of evidence that it is the case.
Still, if they’re right that KANs don’t have this problem at much smaller sizes than MLP-based NNs, that’s very interesting. Nevertheless, I think talking about catastrophic forgetting as a “serious problem in modern ML” seems significantly misleading
Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section
The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.
But yeah, it’s not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.
I’m not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B’s training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?
I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.