Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.
I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).
I also take issue with the way the conclusion is phrased. “Singularities work because they transform random motion into useful search for generalization”. This is only true if you assume that points nearer a singularity generalize better. Maybe I’d phrase it as, “SGD works because it’s more likely to end up near a singularity than the potential alone would predict, and singularities generalize better (see my [Jesse’s] other post)”. Would you agree with this phrasing?
Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.
Hey I need a reason to write a follow-up to this, right?
I also take issue with the way the conclusion is phrased. “Singularities work because they transform random motion into useful search for generalization”. This is only true if you assume that points nearer a singularity generalize better. Maybe I’d phrase it as, “SGD works because it’s more likely to end up near a singularity than the potential alone would predict, and singularities generalize better (see my [Jesse’s] other post)”. Would you agree with this phrasing?
I was trying to be intentionally provocative, but you’re right — it’s too much. Thanks for the suggestion!
Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.
I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).
I also take issue with the way the conclusion is phrased. “Singularities work because they transform random motion into useful search for generalization”. This is only true if you assume that points nearer a singularity generalize better. Maybe I’d phrase it as, “SGD works because it’s more likely to end up near a singularity than the potential alone would predict, and singularities generalize better (see my [Jesse’s] other post)”. Would you agree with this phrasing?
Hey I need a reason to write a follow-up to this, right?
I was trying to be intentionally provocative, but you’re right — it’s too much. Thanks for the suggestion!