To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we either just trained longer or had faster hardware—at least not with certainty. I didn’t expect double descent and grokking so my prior is that unexpected stuff happens.
I definitely agree that your timelines should take into account “maybe there will be a surprise”.
“There can be surprises” cuts both ways; you can also see e.g. a surprise slowdown of scaling results.
I also didn’t expect double descent and grokking but it’s worth noting that afaict those have had ~zero effects on SOTA capabilities so far.
Regardless, the original question was about this particular result; this particular result was not surprising (given my very brief skim).
On catastrophic forgetting:
I agree that catastrophic forgetting is becoming less of an issue at larger scale but I already believed and expected that; it seemed like something like that had to be true for all of the previous big neural net results (OpenAI Five, AlphaStar, language models, etc) to be working as well as they were.
… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?
I like your point that “surprises cut both ways” and assume that this is why your timelines aren’t affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn’t double descent basically what we see with giant language models lately? Disclaimer: I don’t work on LLMs myself, so my confusion isn’t necessarily meaningful
My timelines are affected by the possibility of surprises; it makes them wider on both ends.
My impression is that giant language models are not trained to the interpolation point (though I haven’t been keeping up with the literature for the last year or so). I believe the graphs in that post were created specifically to demonstrate that if you did train them past the interpolation point, then you would see double descent.
To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we either just trained longer or had faster hardware—at least not with certainty. I didn’t expect double descent and grokking so my prior is that unexpected stuff happens.
On surprises:
I definitely agree that your timelines should take into account “maybe there will be a surprise”.
“There can be surprises” cuts both ways; you can also see e.g. a surprise slowdown of scaling results.
I also didn’t expect double descent and grokking but it’s worth noting that afaict those have had ~zero effects on SOTA capabilities so far.
Regardless, the original question was about this particular result; this particular result was not surprising (given my very brief skim).
On catastrophic forgetting:
I agree that catastrophic forgetting is becoming less of an issue at larger scale but I already believed and expected that; it seemed like something like that had to be true for all of the previous big neural net results (OpenAI Five, AlphaStar, language models, etc) to be working as well as they were.
I was under the impression that basically all SOTA capabilities rely on double descent. Is that impression wrong?
… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?
I like your point that “surprises cut both ways” and assume that this is why your timelines aren’t affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn’t double descent basically what we see with giant language models lately? Disclaimer: I don’t work on LLMs myself, so my confusion isn’t necessarily meaningful
My timelines are affected by the possibility of surprises; it makes them wider on both ends.
My impression is that giant language models are not trained to the interpolation point (though I haven’t been keeping up with the literature for the last year or so). I believe the graphs in that post were created specifically to demonstrate that if you did train them past the interpolation point, then you would see double descent.