Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.
PF4
Why do you have high confidence that catastrophic forgetting is immune to scaling, given “Effect of scale on catastrophic forgetting in neural networks”, Anonymous 2021?
Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.
Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don’t think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):
Why? A simple baseline: After training on every task, save the model! At inference time, you’re given which task the example is from, use that model for inference. No forgetting!
Strong Catastrophic Forgetting: Let’s say we’ve trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)?
Here: (a) I’m skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel. Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition).
Weak Catastrophic Forgetting: I think the ‘cannot store data’ assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it’s the same problem—maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).
This is roughly my intuition. What do you think?
B1: Overall, I’ve updated to state what I’ve written in B1 is not true.
Re: First—What I shall do is after we’ve discussed, I shall relegate it to not-true section at the end of the post (so it’s still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~R
e: Second—Yeah, I’m assuming we train large networks and then think about this problem.R
e: Third—While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from ‘0.9% exterminate all humans’ to ’51% exterminate all humans’ if true (seems likely).R
e: Fourth—I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).
The claim by Evan: The context was identifying trigger points of deception in SGD—Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.
I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about ‘classes’ or ‘heads’ just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn’t really need to worry about catastrophic forgetting because you just type in the text description of what ‘class’ you’re interested in and ‘classify’ that way; a MoE doesn’t worry about task classification, because it learns what sub-expert to dispatch input to. You won’t need to ‘switch between tasks’ (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This solution will make a lot of people very unhappy as they insist that “this isn’t a solution, you just made a very large model, arrgghh, so inefficient and ungreen”, but if it solves the problem, then it solves the problem, and now you’re just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.
I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful. There’s some I host on gwern.net because they’re early examples of estimating NN scaling laws but I didn’t get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive ‘explorable’ web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo
* I’m not sure if this is an example. Do student models ‘take off’ abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it’s hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.
Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting).
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.
Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation. I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful.
Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
In this case, I deferred it as I don’t understand what’s really going on in transparency work.
But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?
Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.
PF4
Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.
Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don’t think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):
Why? A simple baseline: After training on every task, save the model! At inference time, you’re given which task the example is from, use that model for inference. No forgetting!
Strong Catastrophic Forgetting: Let’s say we’ve trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)?
Here: (a) I’m skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel.
Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition).
Weak Catastrophic Forgetting: I think the ‘cannot store data’ assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it’s the same problem—maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).
This is roughly my intuition. What do you think?
B1: Overall, I’ve updated to state what I’ve written in B1 is not true.
Re: First—What I shall do is after we’ve discussed, I shall relegate it to not-true section at the end of the post (so it’s still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~R
e: Second—Yeah, I’m assuming we train large networks and then think about this problem.R
e: Third—While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from ‘0.9% exterminate all humans’ to ’51% exterminate all humans’ if true (seems likely).R
e: Fourth—I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).
The claim by Evan: The context was identifying trigger points of deception in SGD—Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.
I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about ‘classes’ or ‘heads’ just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn’t really need to worry about catastrophic forgetting because you just type in the text description of what ‘class’ you’re interested in and ‘classify’ that way; a MoE doesn’t worry about task classification, because it learns what sub-expert to dispatch input to. You won’t need to ‘switch between tasks’ (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This solution will make a lot of people very unhappy as they insist that “this isn’t a solution, you just made a very large model, arrgghh, so inefficient and ungreen”, but if it solves the problem, then it solves the problem, and now you’re just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful. There’s some I host on gwern.net because they’re early examples of estimating NN scaling laws but I didn’t get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive ‘explorable’ web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo
* I’m not sure if this is an example. Do student models ‘take off’ abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it’s hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
There’s a lot of physics-related nonlinearities/phase-transitions/powerlaw material in these workshop slides & videos, looks like: https://sites.google.com/mila.quebec/scaling-laws-workshop/schedule
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
In this case, I deferred it as I don’t understand what’s really going on in transparency work.
But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?