Jesse Hoogland comments on Neural networks generalize because of this one weird trick

Jesse Hoogland 18 Jan 2023 16:05 UTC
3 points
1
What’s going on here? Are you claiming that you get better generalization if you have a large complexity gap between the local singularities you start out with and the local singularities you end up with?
The claim behind figure 7.6 in Watanabe is more conjectural than much of the rest of the book, but the basic point is that adding new samples changes the geometry of your loss landscape. ( $K_{n} (w)$ is different for each $n$ .) As you add more samples the free-energy-minimizing tradeoff starts favoring a more accurate fit and a smaller singularity. This would lead to progressively more complex functions (which seems to match up to observations for SGD).
But smoothness is nice.
Smoothness is nice, but hey we use swishes anyway.
I thought the original scaling laws paper was based on techniques from statistical mechanics? Anyway, that does sound exciting. Do you know if anyone has a plausible model for the Chinchilla scaling laws? Also, I’d like to see if anyone has tried predicting scaling laws for systems with active learning.
The scaling analysis borrows from the empirical side. In terms of predicting the actual coefficients behind these curves, we’re still in the dark. Well, mostly. (There are some ideas.)
I may have given the sense that this scaling-laws program is farther along than it actually is. As far as I know, we’re not there yet with Chinchilla, active learning, etc.