“Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]”
I understand why you think that broad is true. But I’m not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how “robust” an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn’t internally be doing anything different, but we probably would say it failed to optimize, so it looks like successful optimizers optimize for robust optima. Plus that broad optima are more likely to be robust. I’m not sure, but I do notice my confused on the inclusion of “robust”. My current intuition is kinda like “Broadness and robustness of optima are very coupled. But, given that, optimization for robust optima only happens insofar as it is really optimization for broad optima. Optimization for robust but not broad optima does not happen, and optimization for statically broad but more robust optima does not happen better.”
If we’re just optimizing some function, then indeed breadth is the only relevant part. But for something like evolution or SGD, we’re optimizing over random samples, and it’s the use of many different random samples which I’d expect to select for robustness.
Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean “has a lot of (hyper)volume in the optimization space”, and robust to mean “stable over time/perturbation”. I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth.
The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness.
Given what you said about random samples, I think you might be working under definitions along the lines of “robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work” and (at this point I struggled a bit to figure out what a “broad” optima would be that’s different, and this is what I came up with?) “broad optima are those that you can do approximately and still get a significant chunk of the benefit.” I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?
“Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]”
I understand why you think that broad is true. But I’m not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how “robust” an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn’t internally be doing anything different, but we probably would say it failed to optimize, so it looks like successful optimizers optimize for robust optima. Plus that broad optima are more likely to be robust. I’m not sure, but I do notice my confused on the inclusion of “robust”. My current intuition is kinda like “Broadness and robustness of optima are very coupled. But, given that, optimization for robust optima only happens insofar as it is really optimization for broad optima. Optimization for robust but not broad optima does not happen, and optimization for statically broad but more robust optima does not happen better.”
If we’re just optimizing some function, then indeed breadth is the only relevant part. But for something like evolution or SGD, we’re optimizing over random samples, and it’s the use of many different random samples which I’d expect to select for robustness.
Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean “has a lot of (hyper)volume in the optimization space”, and robust to mean “stable over time/perturbation”. I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth.
The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness.
Given what you said about random samples, I think you might be working under definitions along the lines of “robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work” and (at this point I struggled a bit to figure out what a “broad” optima would be that’s different, and this is what I came up with?) “broad optima are those that you can do approximately and still get a significant chunk of the benefit.” I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?