gwern comments on Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

gwern 16 Dec 2021 23:27 UTC
LW: 10 AF: 8
AF
Why do you have high confidence that catastrophic forgetting is immune to scaling, given “Effect of scale on catastrophic forgetting in neural networks”, Anonymous 2021?

My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like “change in KL of overall distribution of output from small vanilla supervised-learning NNs”, and does definitely does not generalize to even small amounts of confidence in assertions like “SGD updates based on small n cannot meaningfully change a very powerful NN’s behavior in a real-world-affecting way”.

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it’s responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the ‘edge of chaos’ in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about ‘oh, NNs are just smooth and don’t change much on one update’ are so much hot air against this universal prior.
- As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg ‘grokking’, or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly ‘gets it’). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that “oh, GPT-3 understands logic at exactly iteration #501,333”, but it should definitely make you think about assumptions like “it must take countless SGD iterations for each of these capabilities to emerge.” (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell’s writings.)
Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN ‘smoothness’. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from ‘cat’ to ‘dog’, why can’t an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as “even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust”.)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, “Exterminate all humans” and “usher in the New Jerusalem”, which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that’s where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent… until he touches the right piece of paper and ‘wakes up’). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the ‘parameters’, updates to the hidden state / internal activations are quite sufficient.)
What links here?
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by bayesian_kitten (16 Dec 2021 22:41 UTC; 22 points)
- bayesian_kitten 17 Dec 2021 2:36 UTC
  3 points
  Parent
  Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.
  
  PF4
  Why do you have high confidence that catastrophic forgetting is immune to scaling, given “Effect of scale on catastrophic forgetting in neural networks”, Anonymous 2021?
  Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.
  Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don’t think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):
  Why? A simple baseline: After training on every task, save the model! At inference time, you’re given which task the example is from, use that model for inference. No forgetting!
  Strong Catastrophic Forgetting: Let’s say we’ve trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)?
  
  Here: (a) I’m skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel.
  Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition).
  
  Weak Catastrophic Forgetting: I think the ‘cannot store data’ assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it’s the same problem—maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).
  This is roughly my intuition. What do you think?
  
  B1: Overall, I’ve updated to state what I’ve written in B1 is not true.
  
  Re: First—What I shall do is after we’ve discussed, I shall relegate it to not-true section at the end of the post (so it’s still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~R
  
  e: Second—Yeah, I’m assuming we train large networks and then think about this problem.R
  
  e: Third—While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from ‘0.9% exterminate all humans’ to ’51% exterminate all humans’ if true (seems likely).R
  
  e: Fourth—I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).
  The claim by Evan: The context was identifying trigger points of deception in SGD—Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.
  - gwern 17 Dec 2021 22:52 UTC
    4 points
    Parent
    I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about ‘classes’ or ‘heads’ just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn’t really need to worry about catastrophic forgetting because you just type in the text description of what ‘class’ you’re interested in and ‘classify’ that way; a MoE doesn’t worry about task classification, because it learns what sub-expert to dispatch input to. You won’t need to ‘switch between tasks’ (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
    
    This solution will make a lot of people very unhappy as they insist that “this isn’t a solution, you just made a very large model, arrgghh, so inefficient and ungreen”, but if it solves the problem, then it solves the problem, and now you’re just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.
    
    I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~
    
    The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
    
    I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful. There’s some I host on gwern.net because they’re early examples of estimating NN scaling laws but I didn’t get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive ‘explorable’ web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo
    
    * I’m not sure if this is an example. Do student models ‘take off’ abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it’s hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.
    
    Using transparency tools to figure out what the model is ‘thinking’ and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting).
    
    I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
    - gwern 19 Dec 2021 19:55 UTC
      4 points
      Parent
      There’s a lot of physics-related nonlinearities/phase-transitions/powerlaw material in these workshop slides & videos, looks like: https://sites.google.com/mila.quebec/scaling-laws-workshop/schedule
      - bayesian_kitten 23 Dec 2021 12:35 UTC
        1 point
        Parent
        This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
        gwern 23 Dec 2021 17:32 UTC
        3 points
        Parent
        I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
    - bayesian_kitten 18 Dec 2021 13:25 UTC
      2 points
      Parent
      When you have enough real-world data, you don’t need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It’s worth noting that no one in the large language model space has ever ‘used up’ all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don’t have to keep around the original dataset to sample maintenance batches from while doing more training.
      This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?
      
      If so, erm, not only just catastrophic forgetting, but a lot of stuff I’ve seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I’m more confident in catastrophic forgetting having it’s effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.
      ..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.
      Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.
      The grokking paper is definitely preliminary. No one expected that and I’m not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
      I don’t have a list of spin-glass papers because I distrust such math/theory papers and haven’t found them to be helpful.
      Very fair, cool. Thanks, those five were nice illustrations, although I’ll need some time to digest the nature of non-linear dynamics. I’ve bookmarked it for an interesting trip someday.
      I’m not sure how useful transparency tools would be. They can’t tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can’t find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.
      In this case, I deferred it as I don’t understand what’s really going on in transparency work.
      
      But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section ‘Application: Transparency’. I notice I’m significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?