[Short version] Information Loss --> Basin flatness

This is an overview for advanced readers. Main post: Information Loss --> Basin flatness

Summary:

Inductive bias is related to, among other things:

Basin flatness
Which solution manifolds (manifolds of zero loss) are higher dimensional than others. This is closely related to “basin flatness”, since each dimension of the manifold is a direction of zero curvature.

In relation to basin flatness and manifold dimension:

It is useful to consider the “behavioral gradients” $\nabla_{θ} f (θ, x_{i})$ for each input.
Let $G$ be the matrix of behavioral gradients. (The $i^{t h}$ column of $G$ is $g_{i} = \nabla_{θ} f (θ, x_{i})$ ).^[1] We can show that $d i m (m a n i f o l d) \leq N - R a n k (G)$ .^[2]
$R a n k (H e s s i a n) = R a n k (G)$ .^[3]^[4]
Flat basin $\approx$ Low-rank Hessian $=$ Low-rank $G$ $\approx$ High manifold dimension
High manifold dimension $\approx$ Low-rank $G$ $=$ Linear dependence of behavioral gradients
A case study in a very small neural network shows that “information loss” is a good qualitative interpretation of this linear dependence.
Models that throw away enough information about the input in early layers are guaranteed to live on particularly high-dimensional manifolds. Precise bounds seem easily derivable and might be given in a future post.

See the main post for details.

^
In standard terminology, $G$ is the Jacobian of the concatenation of all outputs, w.r.t. the parameters.
^
$N$ is the number of parameters in the model. See claims 1 and 2 here for a proof sketch.
^
Proof sketch for $R a n k (H e s s i a n) = R a n k (G)$ :
- $s p a n (g_{1}, . ., g_{k})^{⊥}$ is the set of directions in which the output is not first-order sensitive to parameter change. Its dimensionality is $N - r a n k (G)$ .
- At a local minimum, first-order sensitivity of behavior translates to second-order sensitivity of loss.
- So $s p a n (g_{1}, . ., g_{k})^{⊥}$ is the null space of the Hessian.
- So $r a n k (H e s s i a n) = N - (N - r a n k (G)) = r a n k (G)$
^
There is an alternate proof going through the result $H e s s i a n = 2 G G^{T}$ . (The constant 2 depends on MSE loss.)