Hmm, and the -E[u(X, theta)] term would shrink during training, right? So eventually it would become more about the drift term? This makes me think of the “grokking” concept.
This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would’ve assumed those prediction errors constantly introduce some noise into the gradients?
Ah, yeah, you’re right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing η; that decrease is doing more work than I realized.)
Hmm, and the -E[u(X, theta)] term would shrink during training, right? So eventually it would become more about the drift term? This makes me think of the “grokking” concept.
Both terms shrink near a local minimum.
This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would’ve assumed those prediction errors constantly introduce some noise into the gradients?
Ah, yeah, you’re right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing η; that decrease is doing more work than I realized.)