What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
ℓ is the number of layers. So if it’s 5 layers, then f=A5∘A4∘A3∘A2. It’s one fewer transformation than the number of layers because there is only one between each pair of layers.
Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
I genuinely don’t know. I’ve wondered forever why squaring is so popular. It’s not just in ML, but everywhere.
My best guess is that it’s in some fundamental sense more natural. Suppose you want to guess a location on a map. In that case, the obvious error would be the straight-line distance between you and the target. If your guess is (x,y) and the correct location is (x∗,y∗), then the distance is √(x−x∗)2+(y−y∗)2 – that’s just how distances are computed in 2-dimensional space. (Draw a triangle between both points and use the Pythagorean theorem.) Now there’s a square root, but actually the square root doesn’t matter for the purposes of minimization – the square root is minimal if and only if the thing under the root is minimal, so you might as well minimize (x−x∗)2+(y−y∗)2. The same is true in 3-dimensional space or n-dimensional space. So if general distance in abstract vector spaces works like the straight-line distance does in geometric space, then squared error is the way to go.
Square error has been used instead of absolute error in many diverse optimization problems in part because its derivative is proportional to the magnitude of the error, whereas the derivative of the absolute error is constant. When you’re trying to solve a smooth optimization problem with gradient methods, you generally benefit from loss functions with a smooth gradient than tends towards zero along with the error.
Another possible reason for using squared error is that from a stats perspective, the Bayes (optimal) estimator of the squared error, E[(X−E[X])2], will be the mean of the distribution, whereas the optimal estimator of the MAE will be the median. It’s not clear to me that the mean’s what you want but maybe?
ℓ is the number of layers. So if it’s 5 layers, then f=A5∘A4∘A3∘A2. It’s one fewer transformation than the number of layers because there is only one between each pair of layers.
I genuinely don’t know. I’ve wondered forever why squaring is so popular. It’s not just in ML, but everywhere.
My best guess is that it’s in some fundamental sense more natural. Suppose you want to guess a location on a map. In that case, the obvious error would be the straight-line distance between you and the target. If your guess is (x,y) and the correct location is (x∗,y∗), then the distance is √(x−x∗)2+(y−y∗)2 – that’s just how distances are computed in 2-dimensional space. (Draw a triangle between both points and use the Pythagorean theorem.) Now there’s a square root, but actually the square root doesn’t matter for the purposes of minimization – the square root is minimal if and only if the thing under the root is minimal, so you might as well minimize (x−x∗)2+(y−y∗)2. The same is true in 3-dimensional space or n-dimensional space. So if general distance in abstract vector spaces works like the straight-line distance does in geometric space, then squared error is the way to go.
Also, thanks :)
One reason for using squared errors, which may be good or bad depending on the context, is that it’s usually easier to Do Mathematics on it.
Square error has been used instead of absolute error in many diverse optimization problems in part because its derivative is proportional to the magnitude of the error, whereas the derivative of the absolute error is constant. When you’re trying to solve a smooth optimization problem with gradient methods, you generally benefit from loss functions with a smooth gradient than tends towards zero along with the error.
Another possible reason for using squared error is that from a stats perspective, the Bayes (optimal) estimator of the squared error, E[(X−E[X])2], will be the mean of the distribution, whereas the optimal estimator of the MAE will be the median. It’s not clear to me that the mean’s what you want but maybe?