Pattern comments on A Simple Introduction to Neural Networks

Pattern 10 Feb 2020 2:10 UTC
5 points
General:
This was a very clear explanation. Simplifications were used, then discarded, at good points. Everything built up very well, and I feel I have a much clearer understanding—and more specific questions. (Like how is the number of nodes/layers chosen?)

Specifics:
f=Aℓ∘⋯∘A2
What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
Nonetheless, I at least feel like I now have some nonzero insight into why neural networks are powerful, which is more than I had before reading the paper.
And you’ve explained the ‘ML is just matrix multiplication no one understands’ joke, which I appreciate.
As mentioned, we assume that we’re in the setting of supervised learning, where we have access to a sequence S=((x1,y1),...,(xm,ym)) of training examples. Each xi is an input to the network for which yi is the corresponding correct output.
This topic deserves its own comment. (And me figuring out the formatting.)
For unimportant reasons, we square the difference
Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
This makes it convenient to use in the backpropagation algorithm.
Almost as convenient as the identity function.
- Rafael Harth 10 Feb 2020 8:31 UTC
  2 points
  Parent
  What’s the “ℓ”? (I’m unclear on how one iterates from L to 2.)
  $ℓ$ is the number of layers. So if it’s 5 layers, then $f = A_{5} \circ A_{4} \circ A_{3} \circ A_{2}$ . It’s one fewer transformation than the number of layers because there is only one between each pair of layers.
  Absolute value, because bigger errors are quadratically worse, it was tried and it worked better, or tradition?
  I genuinely don’t know. I’ve wondered forever why squaring is so popular. It’s not just in ML, but everywhere.
  My best guess is that it’s in some fundamental sense more natural. Suppose you want to guess a location on a map. In that case, the obvious error would be the straight-line distance between you and the target. If your guess is $(x, y)$ and the correct location is $(x^{*}, y^{*})$ , then the distance is $\sqrt{(x - x^{*})^{2} + (y - y^{*})^{2}}$ – that’s just how distances are computed in 2-dimensional space. (Draw a triangle between both points and use the Pythagorean theorem.) Now there’s a square root, but actually the square root doesn’t matter for the purposes of minimization – the square root is minimal if and only if the thing under the root is minimal, so you might as well minimize $(x - x^{*})^{2} + (y - y^{*})^{2}$ . The same is true in 3-dimensional space or $n$ -dimensional space. So if general distance in abstract vector spaces works like the straight-line distance does in geometric space, then squared error is the way to go.
  Also, thanks :)
  - Richard_Kennaway 11 Feb 2020 17:05 UTC
    9 points
    Parent
    One reason for using squared errors, which may be good or bad depending on the context, is that it’s usually easier to Do Mathematics on it.
  - ZankerH 11 Feb 2020 13:22 UTC
    2 points
    Parent
    Square error has been used instead of absolute error in many diverse optimization problems in part because its derivative is proportional to the magnitude of the error, whereas the derivative of the absolute error is constant. When you’re trying to solve a smooth optimization problem with gradient methods, you generally benefit from loss functions with a smooth gradient than tends towards zero along with the error.
  - NaiveTortoise 11 Feb 2020 17:50 UTC
    1 point
    Parent
    Another possible reason for using squared error is that from a stats perspective, the Bayes (optimal) estimator of the squared error, $E [(X - E [X])^{2}],$ will be the mean of the distribution, whereas the optimal estimator of the MAE will be the median. It’s not clear to me that the mean’s what you want but maybe?