This was a very clear explanation. Simplifications were used, then discarded, at good points. Everything built up very well, and I feel I have a much clearer understanding—and more specific questions. (Like how is the number of nodes/layers chosen?)
Specifics:
f=Aℓ∘⋯∘A2
What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
Nonetheless, I at least feel like I now have some nonzero insight into why neural networks are powerful, which is more than I had before reading the paper.
And you’ve explained the ‘ML is just matrix multiplication no one understands’ joke, which I appreciate.
As mentioned, we assume that we’re in the setting of supervised learning, where we have access to a sequence S=((x1,y1),...,(xm,ym)) of training examples. Each xi is an input to the network for which yi is the corresponding correct output.
This topic deserves its own comment. (And me figuring out the formatting.)
For unimportant reasons, we square the difference
Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
This makes it convenient to use in the backpropagation algorithm.
What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
ℓ is the number of layers. So if it’s 5 layers, then f=A5∘A4∘A3∘A2. It’s one fewer transformation than the number of layers because there is only one between each pair of layers.
Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
I genuinely don’t know. I’ve wondered forever why squaring is so popular. It’s not just in ML, but everywhere.
My best guess is that it’s in some fundamental sense more natural. Suppose you want to guess a location on a map. In that case, the obvious error would be the straight-line distance between you and the target. If your guess is (x,y) and the correct location is (x∗,y∗), then the distance is √(x−x∗)2+(y−y∗)2 – that’s just how distances are computed in 2-dimensional space. (Draw a triangle between both points and use the Pythagorean theorem.) Now there’s a square root, but actually the square root doesn’t matter for the purposes of minimization – the square root is minimal if and only if the thing under the root is minimal, so you might as well minimize (x−x∗)2+(y−y∗)2. The same is true in 3-dimensional space or n-dimensional space. So if general distance in abstract vector spaces works like the straight-line distance does in geometric space, then squared error is the way to go.
Square error has been used instead of absolute error in many diverse optimization problems in part because its derivative is proportional to the magnitude of the error, whereas the derivative of the absolute error is constant. When you’re trying to solve a smooth optimization problem with gradient methods, you generally benefit from loss functions with a smooth gradient than tends towards zero along with the error.
Another possible reason for using squared error is that from a stats perspective, the Bayes (optimal) estimator of the squared error, E[(X−E[X])2], will be the mean of the distribution, whereas the optimal estimator of the MAE will be the median. It’s not clear to me that the mean’s what you want but maybe?
General:
This was a very clear explanation. Simplifications were used, then discarded, at good points. Everything built up very well, and I feel I have a much clearer understanding—and more specific questions. (Like how is the number of nodes/layers chosen?)
Specifics:
What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
And you’ve explained the ‘ML is just matrix multiplication no one understands’ joke, which I appreciate.
This topic deserves its own comment. (And me figuring out the formatting.)
Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
Almost as convenient as the identity function.
ℓ is the number of layers. So if it’s 5 layers, then f=A5∘A4∘A3∘A2. It’s one fewer transformation than the number of layers because there is only one between each pair of layers.
I genuinely don’t know. I’ve wondered forever why squaring is so popular. It’s not just in ML, but everywhere.
My best guess is that it’s in some fundamental sense more natural. Suppose you want to guess a location on a map. In that case, the obvious error would be the straight-line distance between you and the target. If your guess is (x,y) and the correct location is (x∗,y∗), then the distance is √(x−x∗)2+(y−y∗)2 – that’s just how distances are computed in 2-dimensional space. (Draw a triangle between both points and use the Pythagorean theorem.) Now there’s a square root, but actually the square root doesn’t matter for the purposes of minimization – the square root is minimal if and only if the thing under the root is minimal, so you might as well minimize (x−x∗)2+(y−y∗)2. The same is true in 3-dimensional space or n-dimensional space. So if general distance in abstract vector spaces works like the straight-line distance does in geometric space, then squared error is the way to go.
Also, thanks :)
One reason for using squared errors, which may be good or bad depending on the context, is that it’s usually easier to Do Mathematics on it.
Square error has been used instead of absolute error in many diverse optimization problems in part because its derivative is proportional to the magnitude of the error, whereas the derivative of the absolute error is constant. When you’re trying to solve a smooth optimization problem with gradient methods, you generally benefit from loss functions with a smooth gradient than tends towards zero along with the error.
Another possible reason for using squared error is that from a stats perspective, the Bayes (optimal) estimator of the squared error, E[(X−E[X])2], will be the mean of the distribution, whereas the optimal estimator of the MAE will be the median. It’s not clear to me that the mean’s what you want but maybe?