How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction?
There was some post a bit ago how evolutionary optimization is somehow equivalent to SGD, and I was going to respond no, that can’t be, as it steps in mostly random directions, so at best it’s equivalent to a random forward gradient method: completely different (worse) asymptotic convergence with respect to parameter dimension as you discuss. There’s a reason why SGD methods end up using large batching/momentum to smooth out gradient noise before stepping.
I do still expect that evolutionary optimization is basically-similar to SGD in terms of what kinds of optima they find, and more generally in terms of what their trajectories look like at a course-grained scale. But algorithmically, yeah, SGD should follow that trajectory a lot faster, especially as dimensionality goes up.
There was some post a bit ago how evolutionary optimization is somehow equivalent to SGD, and I was going to respond no, that can’t be, as it steps in mostly random directions, so at best it’s equivalent to a random forward gradient method: completely different (worse) asymptotic convergence with respect to parameter dimension as you discuss. There’s a reason why SGD methods end up using large batching/momentum to smooth out gradient noise before stepping.
I do still expect that evolutionary optimization is basically-similar to SGD in terms of what kinds of optima they find, and more generally in terms of what their trajectories look like at a course-grained scale. But algorithmically, yeah, SGD should follow that trajectory a lot faster, especially as dimensionality goes up.