Regarding “GD++”: this is almost identical to the dynamics you’d expect when doing gradient descent on linear regression. See p 10 of these lecture notes for an explanation.
Given, here they’re applying this linear transformation to the input data and not as an operator on the weights, but my intuition says there’s got to be some sort of connection here; It’s “removing” (part of) the component of $x$ that can be represented as a linear combination of the data. (Apologies for a half-formed response; Happy to hear any connections others make.)
Regarding “GD++”: this is almost identical to the dynamics you’d expect when doing gradient descent on linear regression. See p 10 of these lecture notes for an explanation.
Given, here they’re applying this linear transformation to the input data and not as an operator on the weights, but my intuition says there’s got to be some sort of connection here; It’s “removing” (part of) the component of $x$ that can be represented as a linear combination of the data. (Apologies for a half-formed response; Happy to hear any connections others make.)
(Edited to fix link formatting.)