Thanks for your detailed notes!
have you considered dropping backprop entirely and using blackbox methods like evolutionary computing
This is a really neat idea that I’d love to explore more. I’ve tried some brief experiments in that area in the past, using Z3 to find valid parameter combinations for different logic gates using that custom activation function. I didn’t have any luck, though; the optimizer ran for hours without finding any solutions and I fell back to a brute-force search instead.
A big part of the issue for me is that I’m just very unfamiliar with the whole domain. There’s probably a lot I did wrong in that experiment that caused it to fail, but I find that there are far fewer resources for those kinds of tools and methods than for backprop and neural network-focused techniques.
I know you have done work in a huge variety of topics. Do you know of any particular black-box optimizers or techniques that might be a good starting point for further exploration of this space? I know of one called Nevergrad, but I think it’s more designed to work for stuff like hyperparam optimization with ~dozens of variables or less rather than hundreds/thousands of network weights with complex, multi-objective optimization problems. I could be wrong though!
Another thing to try is to avoid regularization early on
This is an interesting idea. I actually do the opposite: cut the regularization intensity over time. I end up re-running training many times until I get a good initialization instead.
I’m a little concerned about the need for large weights.
“Large” is kinda relative in this situation. Given the kinds of weights I see after training, I consider anything >1 to be “large”.
For representing logic gates, almost all of them can be represented with integer combinations of weights and biases just due to the way the activation function works. If the models stuck to using just integer values, it would be possible to store the params very efficiently and turn the whole thing from continuous to discrete (which to be honest was the ultimate goal of this work).
However, I wasn’t able to successfully get the models to do that outside of very simple examples. A lot of neurons develop integer weights by themselves, though.
the top of one figure is slightly cut off
Ty for letting me know! I’ll look into that.
Ty again for taking the time to read the post and for the detailed feedback—I truly appreciate it!
Wow, I appreciate this list! I’ve heard of a few of the things you list like the weight-agnostic NNs, but most is entirely new to me.
Tyvm for taking the time to put it together.