In SGD, our “intended” drift is −EX[u(X,θ)] - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias”—a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is
−ηn∇θ⋅VarX[∇θu(X,θ)]
I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?
If so, then I do find this quite interesting. A random connection: I actually thought of one strong form of gradient hacking as forcing the gradient to be (approximately) the same for all samples. Your result seems to imply that if such forms of gradient hacking are indeed possible, then they might be incentivized by SGD.
I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?
If so, then I do find this quite interesting. A random connection: I actually thought of one strong form of gradient hacking as forcing the gradient to be (approximately) the same for all samples. Your result seems to imply that if such forms of gradient hacking are indeed possible, then they might be incentivized by SGD.
Yup, exactly.