Alexander Gietelink Oldenziel comments on Alexander Gietelink Oldenziel’s Shortform

Alexander Gietelink Oldenziel 17 Nov 2024 19:13 UTC
16 points
0
Neural Network have a bias towards Highly Decomposable Functions.
tl;dr Neural networks favor functions that can be “decomposed” into a composition of simple pieces in many ways—“highly decomposable functions”.
Degeneracy = bias under uniform prior
[see here for why I think bias under the uniform prior is important]
Consider a space $W$ of parameters used to implement functions, where each element $w \in W$ specifies a function $f_{w} : X \to Y$ via some map $π$ . Here, the set $W$ is our parameter space, and we can think of each $w$ as representing a specific configuration of the neural network that yields a particular function $f_{w}$ .
The mapping $π$ assigns each point $w \in W$ to a function $f_{w}$ . Due to redundancies and symmetries in parameter space, multiple configurations $w$ might yield the same function, forming what we call a fiber, or the “set of degenerates.” of $f$ $π^{- 1} (f) = {w \in W | π (w) = f_{w} = f}$
This fiber is the set of ways in which the same functional behavior can be achieved by different parameterizations. If we uniformly sample from codes, the degeneracy of a function $f$ counts how likely it is to be sampled.
The Bias Toward Decomposability
Consider a neural network architecture built out of $l$ layers. Mathematically, we can decompose the parameter space $W$ as a product:
$W = W_{1} \times W_{2} \times . . . \times W_{l},$
where each $W_{i}$ represents parameters for a particular layer. The function implemented by the network, $f_{w}$ , is then a composition:
$f_{w} = f_{w_{1}} \circ f_{w_{2}} \circ . . . \circ f_{w_{l}}$
For a function $f$ its degeneracy (or the number of ways to parameterize it) is
$| π^{- 1} (f) | = \sum_{(f_{1}, . . ., f_{l}) \in V (f)} | π^{- 1} (f_{1}) | \cdot | π^{- 1} (f_{2}) | \cdot . . . \cdot | π^{- 1} (f_{l}) |$ .
Here, $V (f)$ is the set of all possible decompositions $f = f_{1} \circ f_{2} \circ . . . \circ f_{l}$ , of $f$ .
That means that functions that have many such decompositions are more likely to be sampled.
In summary, the layered design of neural networks introduces an implicit bias toward highly decomposable functions.