Neural Network have a bias towards Highly Decomposable Functions.
tl;dr Neural networks favor functions that can be “decomposed” into a composition of simple pieces in many ways—“highly decomposable functions”.
Degeneracy = bias under uniform prior
[see here for why I think bias under the uniform prior is important]
Consider a space W of parameters used to implement functions, where each element w∈W specifies a functionfw:X→Y via some map π. Here, the set W is our parameter space, and we can think of each w as representing a specific configuration of the neural network that yields a particular function fw.
The mapping π assigns each point w∈W to a function fw. Due to redundancies and symmetries in parameter space, multiple configurations w might yield the same function, forming what we call a fiber, or the “set of degenerates.” of fπ−1(f)={w∈W|π(w)=fw=f}
This fiber is the set of ways in which the same functional behavior can be achieved by different parameterizations. If we uniformly sample from codes, the degeneracy of a function f counts how likely it is to be sampled.
The Bias Toward Decomposability
Consider a neural network architecture built out of l layers. Mathematically, we can decompose the parameter space W as a product:
W=W1×W2×...×Wl,
where each Wi represents parameters for a particular layer. The function implemented by the network, fw, is then a composition:
fw=fw1∘fw2∘...∘fwl
For a function f its degeneracy (or the number of ways to parameterize it) is
Neural Network have a bias towards Highly Decomposable Functions.
tl;dr Neural networks favor functions that can be “decomposed” into a composition of simple pieces in many ways—“highly decomposable functions”.
Degeneracy = bias under uniform prior
[see here for why I think bias under the uniform prior is important]
Consider a space W of parameters used to implement functions, where each element w∈W specifies a functionfw:X→Y via some map π. Here, the set W is our parameter space, and we can think of each w as representing a specific configuration of the neural network that yields a particular function fw.
The mapping π assigns each point w∈W to a function fw. Due to redundancies and symmetries in parameter space, multiple configurations w might yield the same function, forming what we call a fiber, or the “set of degenerates.” of f π−1(f)={w∈W|π(w)=fw=f}
This fiber is the set of ways in which the same functional behavior can be achieved by different parameterizations. If we uniformly sample from codes, the degeneracy of a function f counts how likely it is to be sampled.
The Bias Toward Decomposability
Consider a neural network architecture built out of l layers. Mathematically, we can decompose the parameter space W as a product:
W=W1×W2×...×Wl,
where each Wi represents parameters for a particular layer. The function implemented by the network, fw, is then a composition:
fw=fw1∘fw2∘...∘fwl
For a function f its degeneracy (or the number of ways to parameterize it) is
|π−1(f)|=∑(f1,...,fl)∈V(f)|π−1(f1)|⋅|π−1(f2)|⋅...⋅|π−1(fl)|.
Here, V(f) is the set of all possible decompositions f=f1∘f2∘...∘fl, of f.
That means that functions that have many such decompositions are more likely to be sampled.
In summary, the layered design of neural networks introduces an implicit bias toward highly decomposable functions.