One argument sketch using SLT that NNs are biased towards low complexity solutions: suppose reality is generated by a width 3 network, and you’re modelling it with a width 4 network. Then, along with the generic symmetries, optional solutions also have continuous symmetries where you can switch which neuron is turned off.
Roughly, say neurons 3 and 4 have the same input weight vectors (so their activations are the same), but neuron 4′s output weight vector is all zeros. Then you can continuously scale up the output vector of neuron 4 while simultaneously scaling down the output vector of neuron 3 to leave the network computing the same function. Also, when neuron 4 has zero weights as inputs and outputs you can arbitrarily change the inputs or the outputs but not both.
Anyway, this means that when the data is generated by a slim neural net, optimal nets will have a good RLCT, but when it’s generated by a neural net of the right width, optimal nets will have a bad RLCT. So nets can learn simple data, and it’s easier for them to learn simple data than complex data—assuming thin neural nets count as simple.
This is basically a justification of something like your point 1, but AFAICT it’s closer to a proof in the SLT setting than in your setting.
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it’s closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like “the ground truth function is generated by a small neural net” and “learning is done in a Bayesian way”, etc.
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
No? It amounts to assuming that smaller neural networks are a better match for the actual data generating process of the world.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).
One argument sketch using SLT that NNs are biased towards low complexity solutions: suppose reality is generated by a width 3 network, and you’re modelling it with a width 4 network. Then, along with the generic symmetries, optional solutions also have continuous symmetries where you can switch which neuron is turned off.
Roughly, say neurons 3 and 4 have the same input weight vectors (so their activations are the same), but neuron 4′s output weight vector is all zeros. Then you can continuously scale up the output vector of neuron 4 while simultaneously scaling down the output vector of neuron 3 to leave the network computing the same function. Also, when neuron 4 has zero weights as inputs and outputs you can arbitrarily change the inputs or the outputs but not both.
Anyway, this means that when the data is generated by a slim neural net, optimal nets will have a good RLCT, but when it’s generated by a neural net of the right width, optimal nets will have a bad RLCT. So nets can learn simple data, and it’s easier for them to learn simple data than complex data—assuming thin neural nets count as simple.
This is basically a justification of something like your point 1, but AFAICT it’s closer to a proof in the SLT setting than in your setting.
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like “the ground truth function is generated by a small neural net” and “learning is done in a Bayesian way”, etc.
No? It amounts to assuming that smaller neural networks are a better match for the actual data generating process of the world.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).