It’s an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
A feature to me is the same kind of thing it is to e.g. Chris Olah. It’s the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I’m not assuming that the function is linear in \Theta. If it was, this whole thing wouldn’t just be an approximation within second order Taylor expansion distance, it’d hold everywhere.
OK maybe I’ll try to avoid a debate about exactly what ‘feature’ means or means to different people, but in the example, you are clearly using f(x)=Θ0+Θ1x1+Θ2cos(x1). This is a linear function of the Θ variables. (I said “Is the example not too contrived....in particular it is linear in Θ”—I’m not sure how we have misunderstood each other, perhaps you didn’t realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ. So again, in the crucial equation right after you say “The Hessian matrix for this network would be...”, you in general get Θ variables appearing in the matrix. It is just not as clean as this expression suggests in general.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
OK maybe I’ll try to avoid a debate about exactly what ‘feature’ means or means to different people, but in the example, you are clearly using f(x)=Θ0+Θ1x1+Θ2cos(x1). This is a linear function of the Θ variables. (I said “Is the example not too contrived....in particular it is linear in Θ”—I’m not sure how we have misunderstood each other, perhaps you didn’t realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ. So again, in the crucial equation right after you say “The Hessian matrix for this network would be...”, you in general get Θ variables appearing in the matrix. It is just not as clean as this expression suggests in general.
Fair enough, should probably add a footnote.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
You’re right about the loss thing; it isn’t as important as I first thought it might be.