Hi,
I am Ameya Prabhu, a DPhil candidate at University of Oxford at Torr Vision Group. I work on Continual Learning, geared towards designing ML systems which have a notion of knowledge (memory).
Hi,
I am Ameya Prabhu, a DPhil candidate at University of Oxford at Torr Vision Group. I work on Continual Learning, geared towards designing ML systems which have a notion of knowledge (memory).
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
They’re likely to be interchangeable, sorry. Here I might’ve misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don’t know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
Am I correct in thinking the ‘ersatz’ and ‘real’ interpretability might differ in aspects more than just degree of interpretability—Ersatz is somewhat embedded in explaining the typically case, whereas ‘real interpretability’ gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.
Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful.
Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
In this case, I deferred it as I don’t understand what’s really going on in transparency work.
But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?
Hi! Thanks for reading and interesting questions:
I read that first sentence several times and it’s still not clear what you mean, or how the footnote helps clarify. What do you mean by ‘tweak’? A tweak is a small incremental change.
That’s correct, what I meant is say we state an agent has ‘x, y, z biases’, it can try to correct them. Now, the changes cannot be arbitrary, the constraints are that it has to be competitive and robust. But I think it can reduce the strength of the heuristic by going against it whenever it can to the extent those heuristics would have little usefulness. But it’s likely that here I’m having a wrong and weird conception of superintelligence.
Your section on adversarial examples will not hold up well—that is a bet I am fairly confident on.
Huh. I will find that very surprising, it should hold up to sparsity. Let me clarify my reasoning and then can you say what do you find I might be missing? Or why do you still think so?
Proper sparse regularized internal weights and activations—which compress and thus filter out noise—can provide the same level of defense against adversarial pertubations that biological cortical vision/sensing provides.
Notice that this noise is average case, whereas adversarial examples are worst case. This difference might be doing a lot of heavy-lifting here. Conventional deep networks have really nice noise stability properties, as in, they are able to filter out injected noise to a good extent, illustrated in Stronger generalization bounds for deep nets via a compression approach (ICML ’18). In the worst case, despite 3D vision, narrow-focus and other biases/limitations of human vision give a wide variety of adversarial examples. Some examples: ‘the the’ reading problem, not noticing big large objects crossing in a video or falling prey to a good variety of illusions are some varieties of adversarial examples for human vision. I’m not sure if human vision is a good example of a robust sensing pipeline.
I know this based on my own internal theory and experiments rather than an specific paper, but just a quick search on the literature reveals theoretical&experimental support 1,2,3,4
Err, I find citing literature to be often insufficient, especially in ML, to meet a reasonable bar of support (a surprising amount of papers accepted in top conferences are unfortunately lame). I usually have to carefully read and analyze them.
For these papers, quickly reading some of them, my comment is as follows—Notice that often: (a) they do not test with adaptive attacks (b) the degree of robustness they provide to weaker attacks is minimal (c) any simple defense like gaussian smoothing will do a lot better. Hence, they would provide little support about robustness.
For comparison, good empirical results in compression-gives-robustness look like: Robustness via Deep Low-Rank Representations (Arxiv). Although insufficient to be a good defense as adaptive attacks are likely to break it. Reference for adaptive attacks: On Adaptive Attacks to Adversarial Example Defenses (NeurIPS ’20)-- one of my favourite works in this area.
Now, let me put forward the case against: Our current understanding of sparsity (works in G5, AR3) is that sparsity allows us to reduce parameterization, but only a certain extent. Effects in AR3 suggest we probably need more complex models, and not simpler ones (with sparsity/regularization) for robustness—i.e. the direction you think we should go towards seems to be opposite than what literature suggests (and this is indeed counter-intuitive!).
I think the reason people (including me) have been pessimistic about this direction and switched to doing research in other things is that it doesn’t seem to give many benefits except a certain reduction in memory/parameterization at the extra cost of code modifications.
Adversarial examples are an artifact of the particular historical trajectory that DL took on GPUs where there is no performance advantage to sparsity … sparsity regularization (over activations especially) is a rather expensive luxury on current GPU software/hardware
I don’t think this is true, to like a large degree. GPUs do take a lot of advantage of sparse patterns or maybe I have a lower bar of ‘sparsity works!’ than yours. Pytorch takes and speeds up memory and computations by a huge amount if you take sparse tensors! If you have structured sparsity (blocksparse structure, pointwise convolutions), it’s even better and there are some very fast CUDA kernels to leverage that.
It has limited upside, not contributing interesting/helpful inductive biases. It’s fairly common to sparsify and quantize deep networks in deployment phase, although often the non-sparse CUDA kernels work fine as they’re insanely optimized.
Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.
PF4
Why do you have high confidence that catastrophic forgetting is immune to scaling, given “Effect of scale on catastrophic forgetting in neural networks”, Anonymous 2021?
Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.
Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don’t think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):
Why? A simple baseline: After training on every task, save the model! At inference time, you’re given which task the example is from, use that model for inference. No forgetting!
Strong Catastrophic Forgetting: Let’s say we’ve trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)?
Here: (a) I’m skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel.
Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition).
Weak Catastrophic Forgetting: I think the ‘cannot store data’ assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it’s the same problem—maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).
This is roughly my intuition. What do you think?
B1: Overall, I’ve updated to state what I’ve written in B1 is not true.
Re: First—What I shall do is after we’ve discussed, I shall relegate it to not-true section at the end of the post (so it’s still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~R
e: Second—Yeah, I’m assuming we train large networks and then think about this problem.R
e: Third—While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from ‘0.9% exterminate all humans’ to ’51% exterminate all humans’ if true (seems likely).R
e: Fourth—I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).
The claim by Evan: The context was identifying trigger points of deception in SGD—Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.
Referencing recent papers sent my way here (this shall be a live, expanding comment), please do link more if you think they might be useful:
- Inductive biases in theory-based reinforcement learning