Lucius Bushnaq comments on Basin broadness depends on the size and number of orthogonal features

Lucius Bushnaq 31 Aug 2022 20:19 UTC
4 points
0
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of $J f$ .
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
But what it means is that in the next line when you write down the derivative with respect to $Θ$ , it is an unusually clean expression because it now doesn’t depend on $Θ .$
The perturbative series evaluates the network at particular values of $Θ$ . If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
- carboniferous_umbraculum 2 Sep 2022 0:20 UTC
  3 points
  0
  Parent
  You’re right about the loss thing; it isn’t as important as I first thought it might be.