But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
Epic.
A couple things that come to mind.
Linear features = sufficients statistics of exponential families ?
simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
The centrality of ReLU neural networks.
Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say: “very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). “ I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a ‘locality’ constraint on these activation polytopes, ie. they are ‘small’, ‘active on only a few data points’..
Loving this!
Epic.
A couple things that come to mind.
Linear features = sufficients statistics of exponential families ?
simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
formalized by GPD theorem
see generalization by John
exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
The centrality of ReLU neural networks.
Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say:
“very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). “
I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a ‘locality’ constraint on these activation polytopes, ie. they are ‘small’, ‘active on only a few data points’..