I think this only holds if fine tunes are composable, which as far as I can tell they aren’t (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn’t a big deal if you fine tune a little for performance on a few tasks but does mean you probably can’t take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks).
I don’t think I’ve ever heard of any evidence for this being the case.
Probably the best search terms are “catastrophic interference” or “catastrophic forgetting”. Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.
From a certain perspective, it’s not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is “full”, any new information you push into it must necessarily “push” some other information out of it.
There are a number of ways to mitigate this issue, and in fact there’s a whole field of research into ways to mitigate this issue. Examples:
Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you’re clever about it, though the storage requirements still scale linearly with the number of experts).
I don’t think I’ve ever heard of any evidence for this being the case.
Probably the best search terms are “catastrophic interference” or “catastrophic forgetting”. Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.
From a certain perspective, it’s not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is “full”, any new information you push into it must necessarily “push” some other information out of it.
There are a number of ways to mitigate this issue, and in fact there’s a whole field of research into ways to mitigate this issue. Examples:
Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you’re clever about it, though the storage requirements still scale linearly with the number of experts).