I think this only holds if fine tunes are composable, which as far as I can tell they aren’t
You know ‘finetunes are composable’, because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.
If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don’t insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don’t even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training “an” AlphaZero, even though they actually mean something more like “the 512 separate instances of AlphaZero we are composing finetunes of” (or more).*
You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that—but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose.
Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place.
A better model to imagine is not “somehow finetunes from millions of independent models magically compose” (although actually they would compose pretty well), but more like, “millions of independent actors do their ordinary business, while spending their spare bandwidth downloading the latest binary delta from peer nodes (which due to sparsity & not falling too far out of sync, is always on the order of megabytes, not terabytes), and once every tens of thousands of forward passes, discover a novel or hard piece of data, and mail back a few kilobytes of text to the central training node of a few thousand GPUs, who are continually learning on the hard samples being passed back to them by the main fleet, and who keep pushing out an immediately updated model to all of the actor models, and so ‘the model’ is always up to date and no instance is more than hours out of date with ‘the model’ (aside from the usual long tail of stragglers or unhealthy nodes which will get reaped)”.
* I fear this is one of those cases where our casual reification of entities leads to poor intuitions, akin to asking ‘how many computers are in your computer you are using right now?‘; usually, the answer is just ‘1’, because really, who cares how exactly your ‘smartphone’ or ‘laptop’ or ‘desktop’ or ‘server’ is made up of a bunch of different pieces of silicon—unless you’re discussing something like device performance or security, in which case it may matter quite a lot and you’d better not think of yourself as owning ‘a’ smartphone.
I think we may be using words differently. By “task” I mean something more like “predict the next token in a nucleotide sequence” and less like “predict the next token in this one batch of training data that is drawn from the same distribution as all the other batches of training data that the parallel instances are currently training on”.
It’s not an argument that you can’t train a little bit on a whole bunch of different data sources, it’s an argument that running 1.2M identical instances of the same model is leaving a lot of predictive power on the table as compared by having those models specialize. For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.
Once you have a bunch of specialized models “the weights are identical” and “a fine tune can be applied to all members” no longer holds.
For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.
I don’t believe that’s obvious, and to the extent that it’s true, I think it’s largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Once you have a bunch of specialized models “the weights are identical” and “a fine tune can be applied to all members” no longer holds.
Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim (“AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.”) remains true, no matter how you swap out your definition of ‘model’.
DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.
I don’t believe that’s obvious, and to the extent that it’s true, I think it’s largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Man, that Li et al paper has pretty wild implications if it generalizes. I’m not sure how to square those results with the Chinchilla paper though (I’m assuming it wasn’t something dumb like “wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute”)
In any case, my point was more “I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models”. If models just keep getting bigger while staying monolithic, I’d count that as pretty definitive evidence that my expectations were wrong.
Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.
As a concrete example, the word “literally” occurs with a frequency of approximately 1⁄100,000. About 1⁄6,000 times it occurs, the word “literally” is followed by the word “crying”, while about 1⁄40,000 of occurrences of the word “literally” are followed by “sobbing”. If you just multiply it out, you should assume that if you saw the word “literally”, the word “crying” should be about 7x more likely to occur than the word “sobbing”. One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it’s more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I’m calculating correctly (source: google ngrams).
If there’s some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you’re feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.
And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.
I’m not sure how to square those results with the Chinchilla paper though
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)
You know ‘finetunes are composable’, because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.
If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don’t insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don’t even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training “an” AlphaZero, even though they actually mean something more like “the 512 separate instances of AlphaZero we are composing finetunes of” (or more).*
You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that—but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose.
Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place.
A better model to imagine is not “somehow finetunes from millions of independent models magically compose” (although actually they would compose pretty well), but more like, “millions of independent actors do their ordinary business, while spending their spare bandwidth downloading the latest binary delta from peer nodes (which due to sparsity & not falling too far out of sync, is always on the order of megabytes, not terabytes), and once every tens of thousands of forward passes, discover a novel or hard piece of data, and mail back a few kilobytes of text to the central training node of a few thousand GPUs, who are continually learning on the hard samples being passed back to them by the main fleet, and who keep pushing out an immediately updated model to all of the actor models, and so ‘the model’ is always up to date and no instance is more than hours out of date with ‘the model’ (aside from the usual long tail of stragglers or unhealthy nodes which will get reaped)”.
* I fear this is one of those cases where our casual reification of entities leads to poor intuitions, akin to asking ‘how many computers are in your computer you are using right now?‘; usually, the answer is just ‘1’, because really, who cares how exactly your ‘smartphone’ or ‘laptop’ or ‘desktop’ or ‘server’ is made up of a bunch of different pieces of silicon—unless you’re discussing something like device performance or security, in which case it may matter quite a lot and you’d better not think of yourself as owning ‘a’ smartphone.
I think we may be using words differently. By “task” I mean something more like “predict the next token in a nucleotide sequence” and less like “predict the next token in this one batch of training data that is drawn from the same distribution as all the other batches of training data that the parallel instances are currently training on”.
It’s not an argument that you can’t train a little bit on a whole bunch of different data sources, it’s an argument that running 1.2M identical instances of the same model is leaving a lot of predictive power on the table as compared by having those models specialize. For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.
Once you have a bunch of specialized models “the weights are identical” and “a fine tune can be applied to all members” no longer holds.
I don’t believe that’s obvious, and to the extent that it’s true, I think it’s largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim (“AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.”) remains true, no matter how you swap out your definition of ‘model’.
DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.
Man, that Li et al paper has pretty wild implications if it generalizes. I’m not sure how to square those results with the Chinchilla paper though (I’m assuming it wasn’t something dumb like “wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute”)
In any case, my point was more “I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models”. If models just keep getting bigger while staying monolithic, I’d count that as pretty definitive evidence that my expectations were wrong.
Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.
As a concrete example, the word “literally” occurs with a frequency of approximately 1⁄100,000. About 1⁄6,000 times it occurs, the word “literally” is followed by the word “crying”, while about 1⁄40,000 of occurrences of the word “literally” are followed by “sobbing”. If you just multiply it out, you should assume that if you saw the word “literally”, the word “crying” should be about 7x more likely to occur than the word “sobbing”. One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it’s more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I’m calculating correctly (source: google ngrams).
If there’s some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you’re feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.
And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)