I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
Fair enough, should probably add a footnote.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
You’re right about the loss thing; it isn’t as important as I first thought it might be.