I know of one called Nevergrad, but I think it’s more designed to work for stuff like hyperparam optimization with ~dozens of variables or less rather than hundreds/thousands of network weights with complex, multi-objective optimization problems.
Nevergrad is more of a library than a single specific algorithm:
Offhand, several of these should work on NNs (Ha is particularly fond of CMA-ES), and several of these are multi-objective algorithms (they highlight PDE and DEMO, but you can also just define a fitness function which does a weighted sum of your objectives into a single index, or bypass the issue entirely by using novelty search-style approaches to build up a big library of diverse agents and only then start doing any curation/selection/optimization).
As Pearce says, you should go skim all of David Ha’s papers as he’s fiddled around a lot with very small networks with unusual activations or topologies etc. For example, weight-agnostic NNs where the irregular connectivity learns the algorithm so randomized weights still compute the right answer.
He has a new library up, EvoJax, for highly-optimized evolutionary algorithms on TPUs, which might be useful. (If you need TPU resources, the TPU Research Cloud is apparently still around and claiming to have plenty of TPUs.)
For your purposes, I think NEAT/HyperNEAT would be worth looking at: such approaches would let you evolve the topology and also use a variety of activation functions and whatever other changes you wanted to experiment with. I agree with Pearce that I’m a bit dubious about hand-engineering such a fancy activation. (It may work but does it really give you more interpretability or other important properties?)
You can also combine evolution with gradients*, and population+novelty search would be of interest. Population search can help with hyperparameter search as well, and would go well with some big runs on TPUs using EvoJax.
Pruning NNs is an old idea dating back to the 1980s, so there’s a deep literature on it with a lot of ideas to try.
This is an interesting idea. I actually do the opposite: cut the regularization intensity over time. I end up re-running training many times until I get a good initialization instead.
That sounds like you’re kinda cheating/bruteforcing a bad strategy… I’d be surprised if having very high regularization at the beginning and then removing it turned out to be optimal. In general, the idea that you should explore and only then exploit is a commonplace in reinforcement learning and in evolutionary computing in particular—that’s one of Stanley & Lehman’s big ideas.
“Large” is kinda relative in this situation. Given the kinds of weights I see after training, I consider anything >1 to be “large”.
Why not use binary weights, then? (You can even backprop them in several ways.)
* I don’t suggest actually trying to do this—either go evolution or go backprop—I’m just linking it because I think surrogate gradients are neat.
Nevergrad is more of a library than a single specific algorithm:
Offhand, several of these should work on NNs (Ha is particularly fond of CMA-ES), and several of these are multi-objective algorithms (they highlight PDE and DEMO, but you can also just define a fitness function which does a weighted sum of your objectives into a single index, or bypass the issue entirely by using novelty search-style approaches to build up a big library of diverse agents and only then start doing any curation/selection/optimization).
As Pearce says, you should go skim all of David Ha’s papers as he’s fiddled around a lot with very small networks with unusual activations or topologies etc. For example, weight-agnostic NNs where the irregular connectivity learns the algorithm so randomized weights still compute the right answer.
He has a new library up, EvoJax, for highly-optimized evolutionary algorithms on TPUs, which might be useful. (If you need TPU resources, the TPU Research Cloud is apparently still around and claiming to have plenty of TPUs.)
For your purposes, I think NEAT/HyperNEAT would be worth looking at: such approaches would let you evolve the topology and also use a variety of activation functions and whatever other changes you wanted to experiment with. I agree with Pearce that I’m a bit dubious about hand-engineering such a fancy activation. (It may work but does it really give you more interpretability or other important properties?)
You can also combine evolution with gradients*, and population+novelty search would be of interest. Population search can help with hyperparameter search as well, and would go well with some big runs on TPUs using EvoJax.
Pruning NNs is an old idea dating back to the 1980s, so there’s a deep literature on it with a lot of ideas to try.
That sounds like you’re kinda cheating/bruteforcing a bad strategy… I’d be surprised if having very high regularization at the beginning and then removing it turned out to be optimal. In general, the idea that you should explore and only then exploit is a commonplace in reinforcement learning and in evolutionary computing in particular—that’s one of Stanley & Lehman’s big ideas.
Why not use binary weights, then? (You can even backprop them in several ways.)
* I don’t suggest actually trying to do this—either go evolution or go backprop—I’m just linking it because I think surrogate gradients are neat.
Wow, I appreciate this list! I’ve heard of a few of the things you list like the weight-agnostic NNs, but most is entirely new to me.
Tyvm for taking the time to put it together.