[Short version] Information Loss --> Basin flatness

This is an overview for advanced readers. Main post: Information Loss --> Basin flatness

Summary:

Inductive bias is related to, among other things:

  • Basin flatness

  • Which solution manifolds (manifolds of zero loss) are higher dimensional than others. This is closely related to “basin flatness”, since each dimension of the manifold is a direction of zero curvature.

In relation to basin flatness and manifold dimension:

  1. It is useful to consider the “behavioral gradients” for each input.

  2. Let be the matrix of behavioral gradients. (The column of is ).[1] We can show that .[2]

  3. .[3][4]

  4. Flat basin Low-rank Hessian Low-rank High manifold dimension

  5. High manifold dimension Low-rank Linear dependence of behavioral gradients

  6. A case study in a very small neural network shows that “information loss” is a good qualitative interpretation of this linear dependence.

  7. Models that throw away enough information about the input in early layers are guaranteed to live on particularly high-dimensional manifolds. Precise bounds seem easily derivable and might be given in a future post.

See the main post for details.

  1. ^

    In standard terminology, is the Jacobian of the concatenation of all outputs, w.r.t. the parameters.

  2. ^

    is the number of parameters in the model. See claims 1 and 2 here for a proof sketch.

  3. ^

    Proof sketch for :

    • is the set of directions in which the output is not first-order sensitive to parameter change. Its dimensionality is .
    • At a local minimum, first-order sensitivity of behavior translates to second-order sensitivity of loss.

    • So is the null space of the Hessian.

    • So

  4. ^

    There is an alternate proof going through the result . (The constant 2 depends on MSE loss.)