NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.
While this averaging approach can’t solve small data problems, it is perfectly suitable to today’s NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.
NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.
While this averaging approach can’t solve small data problems, it is perfectly suitable to today’s NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.