In my life I have never seen a good one-paragraph explanation of backpropagation so I wrote one.
The most natural algorithms for calculating derivatives are done by going through the expression syntax tree[1]. There are two ends in the tree; starting the algorithm from the two ends corresponds to two good derivative algorithms, which are called forward propagation (starting from input variables) and backward propagation respectively. In both algorithms, calculating the derivative of one output variable y1 with respect to one input variable x1 actually creates a lot of intermediate artifacts. In the case of forward propagation, these artifacts means you get δynδx1 for ~free, and in backward propagation you get δy1δxn for ~free. Backpropagation is used in machine learning because usually there is only one output variable (the loss, a number representing difference between model prediction and reality) but a lot of input variables (parameters; in the scale of millions to billions).
This blogpost has the clearest explanation. Credits for the image too.
In the case of forward propagation, these artifacts means you get δyiδx1 for ~free, and in backwards propagation you get δy1δxi for ~free.
Presumably you meant to say something else here than to repeat δyiδx1 twice?
Edit: Oops, I now see. There is a switched i. I did really look quite carefully to spot any difference, but I apparently still wasn’t good enough. This all makes sense now.
I could barely see that despite always using a zoom level of 150%. So I’m sometimes baffled at the default zoom levels of sites like LessWrong, wondering if everyone just has way better eyes than me. I can barely read anything at 100% zoom, and certainly not that tiny difference in the formulas!
Our post font is pretty big, but for many reasons it IMO makes sense for the comment font to be smaller. So that plus LaTeX is a bit of a dicey combination.
In my life I have never seen a good one-paragraph explanation of backpropagation so I wrote one.
The most natural algorithms for calculating derivatives are done by going through the expression syntax tree[1]. There are two ends in the tree; starting the algorithm from the two ends corresponds to two good derivative algorithms, which are called forward propagation (starting from input variables) and backward propagation respectively. In both algorithms, calculating the derivative of one output variable y1 with respect to one input variable x1 actually creates a lot of intermediate artifacts. In the case of forward propagation, these artifacts means you get δynδx1 for ~free, and in backward propagation you get δy1δxn for ~free. Backpropagation is used in machine learning because usually there is only one output variable (the loss, a number representing difference between model prediction and reality) but a lot of input variables (parameters; in the scale of millions to billions).
This blogpost has the clearest explanation. Credits for the image too.
or maybe a directed acyclic graph for multivariable vector-valued functions like f(x,y)=(2x+y, y-x)
Presumably you meant to say something else here than to repeatδyiδx1twice?Edit: Oops, I now see. There is a switched i. I did really look quite carefully to spot any difference, but I apparently still wasn’t good enough. This all makes sense now.
It is hard to see, changed to n.
I could barely see that despite always using a zoom level of 150%. So I’m sometimes baffled at the default zoom levels of sites like LessWrong, wondering if everyone just has way better eyes than me. I can barely read anything at 100% zoom, and certainly not that tiny difference in the formulas!
Our post font is pretty big, but for many reasons it IMO makes sense for the comment font to be smaller. So that plus LaTeX is a bit of a dicey combination.