There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).
The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of optimiser trained NNs are to NNGPs, I think it is sensible to assume that they are similar. However, the important question is still whether this scales to large architectures and datasets, which become computationally much harder to test (as the NNGP kernel becomes harder and harder to compute with size of dataset).
Chris Mingard
Good guess ;)
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite randomly-sampled nets
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well.
Not exactly the same—it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot
I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all the same information. It is a collapsed version of an infinitely wide neural network that does not contain any information about the weights in each layer. This was one of Greg Yang’s points—by definition, a kernel method cannot learn features as you are ignoring the effects of the initial layers, as from a function perspective they are irrelevant—in other words, you have just thrown that information away.
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
(B is unclear).
Assuming that the network is so compressed that it can barely represent the true function without substantial fine-tuning of weights in all layers, weights in early layers would absolutely have to be very different from random initialisation.
However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.
You can make arguments that this is what would happen for very wide networks—but then SGD is probably doing the same thing, unless you’re assuming that it learns a few (e.g.) car detector neurons and then the rest are completely redundant. I would expect the car detector neurons to show up in narrower networks, but by my point immediately above, I don’t see why this has to be an SGD-only property.
My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly.
Yes but again an NNGP has thrown away all information about the weights. The NTK limit effectively passes all the gradient to the last layer, so again, by definition, it is a linear model.
Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit.
Same point as above. The Greg Yang paper shows you need to do the clever reparameterisation to make sure not all the gradient gets passed to the last layer (as it does in NTK). The NNGP flattens the neural network so again there can be no feature learning by that representation. So I think the conclusion “can only come from finiteness” is wrong. The second point is correct, but only because you haven’t collapsed the network into a kernel. If you were to take an extremely wide neural network and train the whole thing by random sampling with some extra steps (e.g. encouraging orthogonality of intermediate outputs between different classes), I don’t see why you wouldn’t have some degree of ‘feature learning’ here.
Perhaps this is a physicist vs mathematician type of thinking though. I think I see where you are coming from, but I don’t think the no feature learning arguments are valid, as I think I outlined.
[Advance apologies if I haven’t explained stuff well enough here. I think the important theme here is that we should maintain a way of thinking about the random sampling picture that is distinct from NNGPs.]
Right, this is an even better argument that NNGPs/random-sampled nets don’t learn features.
Ah I see I need to explain myself further—the following is very counterintuitive but I think it’s right. Learning features involves the movement of weights in the early layers, by definition. The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.
Imagine taking a network and making it so thin that it is only just able to represent the function it needs to. Now try the training with last layer only after randomly initialising the others. You can’t—because the randomly initialised first layers will drastically decrease its expressivity, so you can’t express the true function. Now do the same, but with very wide layers—by the lottery ticket hypothesis (and in the limit of infinite width) this will work well because of the (near) unlimited expressivity. Hence, for narrow networks you have to “learn features” to make sure you are expressive enough, but for wide ones you do not.
Consider ResNet18 on Imagenet. Imagenet has an input dimension of $3\times256\times256\approx200000$. The widths of the layers within the resnet are at least two orders of magnitude smaller (so you are nowhere near the limit of infinite width). This is the case of the thin network I talked about earlier—you have to learn features precisely because the network isn’t expressive enough for you to get away without. I’m pretty sure this is the motivation for making networks deep in the first place—for expressivity reasons.
So my claim is features are important to keep the number of parameters small, but do not in themselves aid inductive bias. I know that the first pushback to this will be “but transfer learning improves inductive bias.” Of course—you are basically taking a network that has just been trained on millions of images, and then using this on a set of new images—there will be some information in common across images that has been encoded in the earlier layers. The hierarchical nature of neural networks allows this to happen, but fundamentally not in a way that could not be explained by the random sampling picture.
So, in conclusion, I don’t think SGD needs to be doing any “feature learning” beyond what can be achieved in the random sampling fashion. Note that the random sampling arguments apply not only in the limit of infinite width.
[However, it is worth noting that this is conjecture, although I think it is the most natural conclusion from what we know about DNNs. That said, I will only be happy to accept it when we have found a good way of rigorously comparing the posteriors of a random-sample trained finite width neural network and its corresponding SGD trained version]
I think this would probably generalize worse than the network with an actual ‘car detector’(this isn’t empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this)
What do we know about SGD-trained nets that suggests this?
Not to be a broken record, but I strongly recommend checking out Greg Yang’s work. He clearly shows that there exist infinite-width limits of SGD that can do feature/transfer learning.
I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.
I 100% agree that Kolmogorov complexity is not the best measure of complexity here—and I would refer anyone to yours and Joar’s comments at https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of for an excellent discussion of this. I am aware that Kolmogorov complexity is defined wrt a UTM, and I should have offered clarification in the blog that a lot of steps were used to make the link between Kolmogorov complexity and these types of input-output maps, and state that we only talk about Kolmgorov complexity because of the Levin bound (somewhat repurposed for input-output maps), which interestingly appears to capture the relationship between probabilities of functions and their complexities for several different complexity measures quite accurately.
[First thank you for your comments and observations—it’s always interesting to read pushback]
First, I think my point about using the GP to measure the volume occupied functions locally to where SGD trained networks are initialised is important. We are not really comparing NNs to NNGPs (well, technically we are, but we are interpreting what the NNGP does differently). We are trying to argue that SGD acts as a random sampler—it will find functions with probability proportional to the volume of those functions local to where the optimiser is in parameter-space. We argue that this quantity is well approximated by the NNGP.
This is relevant to the comments on features: if you look at the definition of $P_B(f|S)$ it’s fairly clear that (assuming training by random sampling) initialising and freezing all but the last layer and then random sampling over that will, in expectation, give precisely the same posterior distribution as if you were to random sample over the whole network. This property holds for finite and infinite width networks. This may seem counterintuitive, but the term P(S|f) in the definition of $P_B(f|S)$ ensures that if the random initialisation of the frozen layers does not allow for 100% training accuracy, that random initialisation is ignored. Therefore, if an optimiser samples functions proportional to their volume, you won’t get any difference in performance if you learn features (optimise the whole network) or do not learn features (randomly initialise and freeze all but the last layer and then train just the last).
Given therefore that the posteriors are the same, it implies that feature learning is not aiding inductive bias—rather, feature learning is important for expressivity reasons. The reason why you can’t just use frozen initial layers and obtain the same inductive bias on SOTA architectures is most likely because you can’t make the layers wide enough, to ensure that the network is expressive enough with high probability. Imagenet for example has input dimension of ~200000 so you would need some very wide layers to approach the wide-layer limit.
Furthermore (and on a slightly different note), it is known that infintesimal GD converges to the Boltzmann distribution for any DNN (very similar to random sampling) https://arxiv.org/abs/2004.01190. This means that the coloured noise in SGD is the only possible source for drastically improved inductive bias (which would have to emerge only on large scales, as we do not observe it at smaller scales). I have also not heard as good a theoretical justification for why this noise would dramatically aid generalisation.
Given this, I think it a sensible null hypothesis that optimisers are approximately performing random sampling from a well-biased parameter-space (with some subtleties, see my other comment about tempered posteriors), at substantially larger scales. This to me makes more sense than “optimisers perform random sampling at small/medium scales, but as you move to bigger scales coloured noise in SGD is the dominant source of inductive bias”.
Finally, I would like to point out that this is my impression from the literature, and my work. I am aware that there’s a lot I don’t know, and if anyone can point out why this line of argument is not correct, or can steelman a case for SGD inductive bias appearing at larger scales, I would be very interested to hear it.
Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias—in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior—which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn’t say anything about why a function like 010101010101… appears more frequently than other functions in the maximum-entropy class.
I agree that “large volume-->simple” is what is shown by the evidence in the papers, as opposed to “simple--> large volume” which is in fact not a claim we do not make anywhere (if we do accidentally please let me know and I will fix it) - see https://arxiv.org/abs/1910.00971 for more detail on this, or Joar Skalse’s comments on https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of, where he discusses functions which don’t obey this rule—such as the identity function, which has small volume and is very simple. If optimisers find functions approximately proportional to their volume in parameter-space, this would be a good explanation for why neural networks struggle to learn identity functions. (In fact, theoretical reasons exist which suggest that such low-probability low-complexity functions should exist, but should be rare https://arxiv.org/abs/1910.00971).
Also, very briefly on your comment on feature learning—the GP limit is used to calculate the volume of functions locally to the initialisation. The fact that kernel methods do not learn features should not be relevant given this interpretation—although there are some interesting corollaries of this—and it is something we are investigating.
I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of—see in particular replies by Joar Skalse (the author of that post). You say that you don’t think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit—I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on “different hyperparameters result in different generalisation” type arguments were found later to be due to different effective training times (see Hoffer et al 2017). We also show that optimiser hyperparameter tuning can affect the generalisation—although in a fashion similar to changing the temperature in fully tempered posteriors (see eqn 1 in https://openreview.net/pdf?id=cu6zDHCfhZx) - in other words, still fundamentally due to the architecture.
Beyond the pretty conclusive evidence that SGD is a much smaller source of inductive bias than the architecture on small/medium scale tasks, I think there is a lot of evidence that the architecture is responsible for the first-order generalisation capabilities of the network elsewhere. For example, https://arxiv.org/abs/2012.04115 shows that architecture-only bounds are excellent predictors of performance on SOTA networks (e.g. wide resnets), as does https://arxiv.org/pdf/2002.02561.pdf (from a different group). For more circumstantial evidence, it is known that CNNs typically outperform fully connected nets for image classification, and transformers outperform lstms for sentiment analysis etc, even though the same type of optimisers are used.
I think there are very interesting questions remaining about the role of the optimiser in narrow networks, feature learning and very large scale models. Clearly though, the methods we used on the small/medium scale architectures and datasets will not scale to these questions without some major changes. For the meantime, we are using current methods to investigate some edge cases, none of which are yet to show strong deviation from our predictions.I would suggest that the architecture being the main source of inductive bias might be a sensible null hypothesis for the cases we are yet to directly probe. I also think that the comparative simplicity of the hypothesis—that SGD finds functions with probability proportional to their volume in parameter space/performs random sampling (very closely when there is strong bias in the parameter-function map and progressively less closely the weaker it gets), and a strong architectural bias towards simplicity (again with some subtleties) causes the good generalisation—is quite compelling.
AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of “train ImageNet without hyperparameters”. That said, there is a stronger sense of “hyperparameter-free” where the optimum batch size and architecture size would decide on the compute-optimal scaling. And, an even stronger sense where the architecture type is selected.
In other words, we have the following hierarchy of lack of hyperparameterness,
learning rate must be selected, sometimes with schedulers etc. or via heuristics, to guarantee convergence for any architecture, dataset, batch size …
pick and architecture, dataset and batch size and it will converge (hopefully) in a near-optimal time
compute-optimal batch size and architecture size is automatically found for a dataset
given a dataset, we are given the best architecture type (e.g. resnet, CNN etc.)
I would argue that we currently are in stage 1. If AGD (or similar optimisers) do actually work like we think, we’re now in stage 2. In my mind, this is a qualitative change.
So, I think calling it “another learning-rate tuner” is a little disingenuous—incorporating information about the architecture seems to move in a direction of eliminating a hyperparameter by removing a degree of freedom, rather than a “learning rate tuner” whichI think of as a heuristic method usually involving trial-and-error, without any explanation for why that learning rate is best. However, if there are similar papers out there already that you think do something similar, or you think I’m wrong in any way, please send them over, or let me know!