The purpose of my writing is to show that they are elegant. Even further, that if you tried to come up with the ideal approximation of SI from first principles, you would just end up with NNs.
Indeed. Although SGD is probably not the optimal approximation of Bayesian inference—for example it doesn’t handle track/handle uncertainty at all, but that is a hot current area of research.
I only barely mentioned it in my post, but there are ways of approximating bayesian inference like MCMC. And in fact there are methods which can take advantage of stochastic gradient information, which should make them roughly as efficient as SGD.
Indeed. Although SGD is probably not the optimal approximation of Bayesian inference—for example it doesn’t handle track/handle uncertainty at all, but that is a hot current area of research.
I only barely mentioned it in my post, but there are ways of approximating bayesian inference like MCMC. And in fact there are methods which can take advantage of stochastic gradient information, which should make them roughly as efficient as SGD.
There is also a recent paper by Deepmind, Weight Uncertainty in Neural Networks.