I don’t believe that’s obvious, and to the extent that it’s true, I think it’s largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Man, that Li et al paper has pretty wild implications if it generalizes. I’m not sure how to square those results with the Chinchilla paper though (I’m assuming it wasn’t something dumb like “wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute”)
In any case, my point was more “I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models”. If models just keep getting bigger while staying monolithic, I’d count that as pretty definitive evidence that my expectations were wrong.
Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.
As a concrete example, the word “literally” occurs with a frequency of approximately 1⁄100,000. About 1⁄6,000 times it occurs, the word “literally” is followed by the word “crying”, while about 1⁄40,000 of occurrences of the word “literally” are followed by “sobbing”. If you just multiply it out, you should assume that if you saw the word “literally”, the word “crying” should be about 7x more likely to occur than the word “sobbing”. One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it’s more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I’m calculating correctly (source: google ngrams).
If there’s some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you’re feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.
And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.
I’m not sure how to square those results with the Chinchilla paper though
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)
Man, that Li et al paper has pretty wild implications if it generalizes. I’m not sure how to square those results with the Chinchilla paper though (I’m assuming it wasn’t something dumb like “wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute”)
In any case, my point was more “I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models”. If models just keep getting bigger while staying monolithic, I’d count that as pretty definitive evidence that my expectations were wrong.
Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.
As a concrete example, the word “literally” occurs with a frequency of approximately 1⁄100,000. About 1⁄6,000 times it occurs, the word “literally” is followed by the word “crying”, while about 1⁄40,000 of occurrences of the word “literally” are followed by “sobbing”. If you just multiply it out, you should assume that if you saw the word “literally”, the word “crying” should be about 7x more likely to occur than the word “sobbing”. One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it’s more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I’m calculating correctly (source: google ngrams).
If there’s some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you’re feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.
And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)