Probably the best search terms are “catastrophic interference” or “catastrophic forgetting”. Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.
From a certain perspective, it’s not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is “full”, any new information you push into it must necessarily “push” some other information out of it.
There are a number of ways to mitigate this issue, and in fact there’s a whole field of research into ways to mitigate this issue. Examples:
Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you’re clever about it, though the storage requirements still scale linearly with the number of experts).
Probably the best search terms are “catastrophic interference” or “catastrophic forgetting”. Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.
From a certain perspective, it’s not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is “full”, any new information you push into it must necessarily “push” some other information out of it.
There are a number of ways to mitigate this issue, and in fact there’s a whole field of research into ways to mitigate this issue. Examples:
Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you’re clever about it, though the storage requirements still scale linearly with the number of experts).