Algon comments on Neural networks generalize because of this one weird trick

Algon 18 Jan 2023 14:49 UTC
1 point
0
$Complex Singularities ⟺ Fewer Parameters ⟺ Simpler Functions ⟺ Better Generalization$
[...]
In a Bayesian learning process, the relevant singularity becomes progressively simpler with more data. In general, learning processes involve trading off a more accurate fit against “regularizing” singularities. Based on Figure 7.6 in [1].
What’s going on here? Are you claiming that you get better generalization if you have a large complexity gap between the local singularities you start out with and the local singularities you end up with?
But ReLU networks are not analytic. Idk man, seems unimportant.
But smoothness is nice.
There’s speculation that we might be able to transfer the machinery of the renormalization group, a set of techniques and ideas developed in physics to deal with critical phenomena and scaling, to understand phase transitions in learning machines, and ultimately to compute the scaling coefficients from first principles.
I thought the orginal scaling laws paper was based on techniques from statistical mechanics? Anyway, that does sound exciting. Do you know if anyone has a plausible model for the Chinchilla scaling laws? Also, I’d like to see if anyone has tried predicting scaling laws for systems with active learning.
- Jesse Hoogland 18 Jan 2023 16:05 UTC
  3 points
  1
  Parent
  What’s going on here? Are you claiming that you get better generalization if you have a large complexity gap between the local singularities you start out with and the local singularities you end up with?
  The claim behind figure 7.6 in Watanabe is more conjectural than much of the rest of the book, but the basic point is that adding new samples changes the geometry of your loss landscape. ( $K_{n} (w)$ is different for each $n$ .) As you add more samples the free-energy-minimizing tradeoff starts favoring a more accurate fit and a smaller singularity. This would lead to progressively more complex functions (which seems to match up to observations for SGD).
  But smoothness is nice.
  Smoothness is nice, but hey we use swishes anyway.
  I thought the original scaling laws paper was based on techniques from statistical mechanics? Anyway, that does sound exciting. Do you know if anyone has a plausible model for the Chinchilla scaling laws? Also, I’d like to see if anyone has tried predicting scaling laws for systems with active learning.
  The scaling analysis borrows from the empirical side. In terms of predicting the actual coefficients behind these curves, we’re still in the dark. Well, mostly. (There are some ideas.)
  I may have given the sense that this scaling-laws program is farther along than it actually is. As far as I know, we’re not there yet with Chinchilla, active learning, etc.