Late to the party, but thanks for writing this up! I’m confused about two points in this calculation of the Theory section:
The FLOP needed to compute the term “δ3@A2R” (and similar)
I understand this to be the outer product of two vectors, δ3 with length #output, and A2R with length #hidden2
If that’s the case, should this require only #output*#hidden2*#batch FLOP (without the factor two in the table), since it’s just the multiplication of each pair of numbers?
Do the parameter updates need to be accumulated for each example in the batch?
If this is the case, would this mean there’s an additional FLOP for each parameter for each example in the batch?
Late to the party, but thanks for writing this up! I’m confused about two points in this calculation of the Theory section:
The FLOP needed to compute the term “δ3@A2R” (and similar)
I understand this to be the outer product of two vectors, δ3 with length #output, and A2R with length #hidden2
If that’s the case, should this require only #output*#hidden2*#batch FLOP (without the factor two in the table), since it’s just the multiplication of each pair of numbers?
Do the parameter updates need to be accumulated for each example in the batch?
If this is the case, would this mean there’s an additional FLOP for each parameter for each example in the batch?
I think these two points end up cancelling out so this still ends up with the 2:1 ratio, as expected. I think these points are also consistent with the explanation here: https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4