SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It’s just a general optimization algorithm which is useful for fitting models to data.
How well read are you in machine learning? 10′s of papers? 100′s? 1000′s? PHD level? This and your other comment about IRL suggests that you have only cursory knowledge of the field. Also ‘bayesianism’ isn’t a thing, outside of LW.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Inference problems can be turned into optimization problems and vice versa. In particular the single MLE estimate from a full exhaustive inference over some data set conditioned on some observable is exactly equivalent to a global optimization problem solved with exhaustive search.
Exhaustive methods have exponential order costs, so the first obvious large improvement is to approximate the full joint CDF by a factored graphical model, such as a factor graph. For real valued variables, tracking full distributions is still quite expensive, so the next level of approximation/optimization is to use simple analytic distributions such as gaussians. Another useful approximation then is to use some incremental sampling algorithm.
SGD type algorithms are equivalent to approximate MLE inference where only the mean of each variable is tracked, update messages are swept through the data in a simple fashion, and the variance is related to the learning rate.
I apologize if my comment came off as rude. I certainly didn’t mean to assert any kind of authority over this. I am just a hobbyists and some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better. The points about bayesian inference was just tangential.
I didn’t mean bayesianism the philosophy, just bayesian methods.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Yes but it’s not an optimization algorithms. Optimization algorithms are more general than statistics. You can use an optimization algorithm to find the optimal parameters for an airplane wing or the shortest path between several cities.
Conversely bayesian inference doesn’t specify how the parameters should be optimized, just that you should somehow weigh every possibility according to it’s probability.
I am not saying that they aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
.. .some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better.
I didn’t say SGD is the main limiting factor of ANNs, or that the brain using something far better. I said “the brain probably uses something even better than modern SGD ..”
Modern SGD methods—especially with auto learning rate tuning and the new normalization schemes (which btw relates directly to better variance/uncertainty models in stat inference methods) - are pretty powerful, but they still learn somewhat slowly, requiring numerous passes through the data to reach a good solution.
I don’t have time to dig deep into how the brain may use techniques better than SGD … but as a simple single example of one thing it does better: current SGD ANN training computes the same update steps for the same high cost across the entire network for every training example, even though examples vary vastly in their novelty/difficulty/utility of learning. The brain appears to be much better about managing its limited resources.
I am not saying that [inference and optimization] aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
They are largely interchangeable in machine learning in the sense that you can use optimization techniques (SGD) or inference techniques (expectation propagation, expectation backpropagation, MCMC, etc) to train a model (such as an ANN).
Much of the ‘wisdom’ or deep insightful knowledge in a particular field consists of learning all the structural relations and symmetries between different algorithms/techniques which enable internal mental compression of all of the raw low level knowledge: learning which techniques are generalizations, specializations, approximations, or restricted transformations of others. In the beginning, everything looks disconnected and compartmentalized, but eventually one sees how everything is connected.
General optimization can be used to implement inference, and vice versa. You can recast optimization as an inference problem: the initial settings/constraints become a prior, the utility/loss function is converted into a probability measure, learning rates relate to variance/precision, etc. See survey papers such as “Representation Learning”, or look into the use of bayesian methods in machine learning (as replacements for optimization methods) to get some perspective on how they all relate.
How well read are you in machine learning? 10′s of papers? 100′s? 1000′s? PHD level? This and your other comment about IRL suggests that you have only cursory knowledge of the field. Also ‘bayesianism’ isn’t a thing, outside of LW.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Inference problems can be turned into optimization problems and vice versa. In particular the single MLE estimate from a full exhaustive inference over some data set conditioned on some observable is exactly equivalent to a global optimization problem solved with exhaustive search.
Exhaustive methods have exponential order costs, so the first obvious large improvement is to approximate the full joint CDF by a factored graphical model, such as a factor graph. For real valued variables, tracking full distributions is still quite expensive, so the next level of approximation/optimization is to use simple analytic distributions such as gaussians. Another useful approximation then is to use some incremental sampling algorithm.
SGD type algorithms are equivalent to approximate MLE inference where only the mean of each variable is tracked, update messages are swept through the data in a simple fashion, and the variance is related to the learning rate.
I apologize if my comment came off as rude. I certainly didn’t mean to assert any kind of authority over this. I am just a hobbyists and some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better. The points about bayesian inference was just tangential.
I didn’t mean bayesianism the philosophy, just bayesian methods.
Yes but it’s not an optimization algorithms. Optimization algorithms are more general than statistics. You can use an optimization algorithm to find the optimal parameters for an airplane wing or the shortest path between several cities.
Conversely bayesian inference doesn’t specify how the parameters should be optimized, just that you should somehow weigh every possibility according to it’s probability.
I am not saying that they aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
I didn’t say SGD is the main limiting factor of ANNs, or that the brain using something far better. I said “the brain probably uses something even better than modern SGD ..”
Modern SGD methods—especially with auto learning rate tuning and the new normalization schemes (which btw relates directly to better variance/uncertainty models in stat inference methods) - are pretty powerful, but they still learn somewhat slowly, requiring numerous passes through the data to reach a good solution.
I don’t have time to dig deep into how the brain may use techniques better than SGD … but as a simple single example of one thing it does better: current SGD ANN training computes the same update steps for the same high cost across the entire network for every training example, even though examples vary vastly in their novelty/difficulty/utility of learning. The brain appears to be much better about managing its limited resources.
They are largely interchangeable in machine learning in the sense that you can use optimization techniques (SGD) or inference techniques (expectation propagation, expectation backpropagation, MCMC, etc) to train a model (such as an ANN).
Much of the ‘wisdom’ or deep insightful knowledge in a particular field consists of learning all the structural relations and symmetries between different algorithms/techniques which enable internal mental compression of all of the raw low level knowledge: learning which techniques are generalizations, specializations, approximations, or restricted transformations of others. In the beginning, everything looks disconnected and compartmentalized, but eventually one sees how everything is connected.
General optimization can be used to implement inference, and vice versa. You can recast optimization as an inference problem: the initial settings/constraints become a prior, the utility/loss function is converted into a probability measure, learning rates relate to variance/precision, etc. See survey papers such as “Representation Learning”, or look into the use of bayesian methods in machine learning (as replacements for optimization methods) to get some perspective on how they all relate.