I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about ‘classes’ or ‘heads’ just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn’t really need to worry about catastrophic forgetting because you just type in the text description of what ‘class’ you’re interested in and ‘classify’ that way; a MoE doesn’t worry about task classification, because it learns what sub-expert to dispatch input to. You won’t need to ‘switch between tasks’ (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This solution will make a lot of people very unhappy as they insist that “this isn’t a solution, you just made a very large model, arrgghh, so inefficient and ungreen”, but if it solves the problem, then it solves the problem, and now you’re just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.
I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful. There’s some I host on gwern.net because they’re early examples of estimating NN scaling laws but I didn’t get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive ‘explorable’ web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo
* I’m not sure if this is an example. Do student models ‘take off’ abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it’s hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.
Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting).
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.
Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation. I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful.
Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
In this case, I deferred it as I don’t understand what’s really going on in transparency work.
But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?
I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about ‘classes’ or ‘heads’ just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn’t really need to worry about catastrophic forgetting because you just type in the text description of what ‘class’ you’re interested in and ‘classify’ that way; a MoE doesn’t worry about task classification, because it learns what sub-expert to dispatch input to. You won’t need to ‘switch between tasks’ (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This solution will make a lot of people very unhappy as they insist that “this isn’t a solution, you just made a very large model, arrgghh, so inefficient and ungreen”, but if it solves the problem, then it solves the problem, and now you’re just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful. There’s some I host on gwern.net because they’re early examples of estimating NN scaling laws but I didn’t get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive ‘explorable’ web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo
* I’m not sure if this is an example. Do student models ‘take off’ abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it’s hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
There’s a lot of physics-related nonlinearities/phase-transitions/powerlaw material in these workshop slides & videos, looks like: https://sites.google.com/mila.quebec/scaling-laws-workshop/schedule
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
In this case, I deferred it as I don’t understand what’s really going on in transparency work.
But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?