This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would’ve assumed those prediction errors constantly introduce some noise into the gradients?
Ah, yeah, you’re right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing η; that decrease is doing more work than I realized.)
Both terms shrink near a local minimum.
This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would’ve assumed those prediction errors constantly introduce some noise into the gradients?
Ah, yeah, you’re right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing η; that decrease is doing more work than I realized.)